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IFIP - The International Federation for Information Processing 

IFIP was founded in 1960 under the auspices of UNESCO, following the First World 
Computer Congress held in Paris the previous year. An umbrella organization for societies 
working in information processing, IFIP’s aim is two-fold: to support information processing 
within its member countries and to encourage technology transfer to developing nations. As 
its mission statement clearly states, 

IFIP’s mission is to be the leading, tmly international, apolitical organization which 

encourages and assists in the development, exploitation and application of information 

technology for the benefit of all people. 

IFIP is a non-profitmaking organization, run almost solely by 2500 volunteers. It operates 
through a number of technical committees, which organize events and publications. IFIP’s 
events range from an international congress to local seminars, but the most important are: 

• the IFIP World Computer Congress, held every second year; 

• open conferences; 

• working conferences. 

The flagship event is the IFIP World Computer Congress, at which both invited and 
contributed papers are presented. Contributed papers are rigorously refereed and the rejection 
rate is high. 

As with the Congress, participation in the open conferences is open to all and papers may 
be invited or submitted. Again, submitted papers are stringently refereed. 

The working conferences are structured differently. They are usually run by a working 
group and attendance is small and by invitation only. Their purpose is to create an atmosphere 
conducive to innovation and development. Refereeing is less rigorous and papers are 
subjected to extensive group discussion. 

Publications arising from IFIP events vary. The papers presented at the IFIP World 
Computer Congress and at open conferences are published as conference proceedings, while 
the results of the working conferences are often published as collections of selected and 
edited papers. 

Any national society whose primary activity is in information may apply to become a full 
member of IFIP, although full membership is restricted to one society per country. Full 
members are entitled to vote at the annual General Assembly, National societies preferring a 
less committed involvement may apply for associate or corresponding membership. Associate 
members enjoy the same benefits as frill members, but without voting rights. Corresponding 
members are not represented in IFIP bodies. Affiliated membership is open to non-national 
societies, and individual and honorary membership schemes are also offered. 
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Preface 



Under the theme of T he Millenium Push of Internet’ the eighth IFIP Conference 
on High Performance Networking (HPN ’98) is taking place at the Vienna Univer- 
sity of Technology, Vienna, Austria, September 21 - 25, 1998. Its successful earlier 
conferences were held in Aachen, Germany (1987), Liege, Belgium (1988), Berlin, 
Germany (1991), Liege, Belgium (1992), Grenoble, France (1994), Palma de Mal- 
lorca, Balearic Islands, Spain (1995) and White Plains, New York, USA (1997). 
The conference series has been established to be a forum where specialists from 
industry, network operating, and academia can share their experiences in the new- 
est trends in high performance networking. Thereby can high performance be 
viewed as high-speed, high throughput, low delay or a quality measure representing 
characteristics such as high flexibility, high availability, high scalability, high func- 
tionality or high security. 

HPN '98 focuses on the manifold hot issues related to the revolutionary evolution 
of Internet and Intranets. Today all forms of multimedia networking start playing an 
important role. Designed as computer networks, the traditional interconnecting 
networks were used to carry computer data only. Now, Internet and Intranets are 
also used for Internet telephony, multimedia conferencing, remote lecturing, dis- 
tributed simulations, network games and many real-time applications more. Future 
application areas, with even more sophisticated real-time requirements, might in- 
clude telemedicine, distributed workgroups, distance learning and telecommuting. 
Multimedia networking faces many technical challenges like real-time data over 
non-real-time networks, high data rates over limited network capacities, and unpre- 
dictable availability of network capacity. Therefore, protocols for real-time appli- 
cations, routing, congestion and flow control, quality-of-service management, in- 
terworking with ATM, high-performance end systems and applications, as well as 
multicast protocols are focal topics of this conference. 

The program of HPN ’98 consists of three invited papers highlighting main trends 
in the Internet arena and 38 papers selected from 69 paper submissions based on 
three reviews per paper. The papers are presented in eleven sessions during three 
days. Two days with half-day tutorials proceed the conference. Special thanks are 
due to IFIP WG 6.4, to the members of the organizing and technical committees, 
to the Vienna University of Technology, to the sponsors of the conference, to the 
staff of the Institute of Communication Networks, and particularly to Mrs. Johanna 
Pfeifer for her very successful and highly appreciated organizational support. 



Harmen R. van As Conference Chair 



Vienna, September 1998 
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Part One 



Broadband Internet Access 





Broadband access to the Internet - an 
overview 



H. Leopold 
Alcatel 

Scheydgasse 41, A- 1211 Vienna, Austria, 

Tel.: +431 27722 3551, Fax.: +431 27722 1172, 
E-mail: helmut.leopold@aut.alcatel.at 



Abstract 

The evolution of the modern telecommunication environment depends on three 
main factors: (i) the market demand for new services, (ii) the politically stimulated 
liberalisation of the telecommunication market, and (iii) the technological advances 
of new telecommunication technologies. The technological advances include 
getting the fibre closer to the subscribers, cheaper equipment costs and 
standardisation. 

Network operators are forced nowadays to enable, by offering new services, very 
high profit in a very short period of time with limited investment. For this reason, 
especially the access network and the technologies linked to it have an increasing 
importance for the economic success of an operator in a liberalised 
telecommunication market. 

This article describes the most distinguishing aspects of an access network, its 
positioning within a telecommunication system, and the most important 
technological developments in this field: Fibre In The Loop (FITL) systems, 
Digital Subscriber Line (xDSL) technologies on copper twisted pair and the Hybrid 
Fibre Coax (HFC) technology on CATV networks. 

Keywords 

Access network, fibre in the loop (FITL), xDSL, ADSL, HFC, CATV, cable 
telephony, cable data modem 
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1 INTRODUCTION 

The evolution of the modern telecommunication environment depends on three 
main factors: 

• the technological advances of new telecommunication technologies, 

• the market demand for new services, and 

• the politically stimulated liberalisation of the telecom market. 

Especially the liberalisation in the telecommunications market has a tremendous 
impact on business in this area. New actors like international network operators, 
Cable TV (CATV) network operators, energy suppliers, railway organisations, city 
communication operators, etc. will start to offer telecommunication services in 
competition to the incumbent national network operator. This will have an impact 
on the market share, the tariff structure, the Quality of Service (QoS) and the 
offered services to the customers itself. The final target is the so called “Full 
Service Network (FSN)”, which is capable of offering all types of interactive 
multimedia services at any time, any place and any QoS. 

The provisioning of new telecommunication services in general and new 
multimedia services in particular is made possible by the availability of several 
new technologies as well as advances in standardisation: 

• Optical signal transmission and SDH technology which allows the 
establishment of powerful backbone networks. 

• New switching and transmission technologies like ATM which allows the 
transport of information, independent of their nature, and which allows 
dynamic establishment of high-bandwidth connections with flexible Quality of 
Service (QoS), in order to meet the application requirements. However, the 
recently initiated discussion concerning the integration of Internet- and ATM- 
technologies will have an enormous impact on future developments in this 
field (Sales, 1998). 

• Low cost digital mass storage permits economical deployment of digital video 
servers. 

• Powerful video compression techniques like MPEG effectively reduce storage 
expense and transmission bandwidth demand. 

• New wireless technologies for access and transmission networks based on 
satellite or terrestrial. 

• New access network technologies to bring multimedia capacity over the last 
mile to a broad range of subscribers. 

These technological advances are to a large extent also stimulated by the regulatory 
changes around the world for more open competition, liberalisation, and 
privatisation. Operators world wide are faced with this changing environment and 
are now in process to upgrade their network infrastructures in order to become a 
competitive network operator in the new telecommunication environment. 

The question is now, how all these providers will transform their network 
systems and their organisations. Network operators are forced to maximise the 
revenue in a very short period of time with a limited investment. For this reason, 
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especially the access network technologies linked to it have an increasing 
importance for the economic success of an operator in the liberalised telecom 
market. The access network represents an essential part of the overall costs of a 
telecommunication infrastructure. The investment cost are up to 70% for the access 
network and the ownership of the access network, i.e. the Operation and 
Maintenance (OAM) is up to 80% of the total cost. OAM costs include preventive 
and corrective maintenance, network management and repair of defective 
equipment. 

In the past years there has been an international effort from telecommunication 
operators and manufacturers to maximise consensus on the systems required in the 
local access network to deliver a full set of telecommunications services, both 
narrowband and broadband (FSAN, 1997). 

This article discusses the positioning of an access network in an overall 
telecommunication system and presents the recent advances of access technologies, 
where beside Fibre In The Loop (FITL) systems especially Digital Subscriber Line 
(DSL) technologies for twisted-pair telephone lines and Hybrid Fibre Coax (HFC) 
technology for coaxial CATV networks which allow the utilisation of existing 
access network infrastructures in order to offer broadband services additionally, are 
of main importance. 



2 EVOLUTION OF THE ACCESS NETWORK ARCHITECTURE 

The general trend in telecommunication networks is to establish powerful 
backbone networks which are composed of only a few network nodes. These nodes 
are interconnected by broadband connections and have an extensive access 
network, reducing the public switching hierarchy levels (Berkowitz, 1997). 

All network operators which are running fibre optic networks, have established 
such powerful backbone network infrastructures by using SDH, Frame Relay (FR) 
and ATM technologies in order to enter the telecommunication business. Such 
infrastructures are owned, beside the incumbent operators, by new actors like 
energy suppliers, railway organisations, city communication operators etc. Now it 
is important to recognise that such an infrastructure is the basis for entering the 
business market segment only, since there is no appropriate access network 
available in order to address the mass market, i.e. the small and medium sized 
enterprises (SMEs) and the households to offer new multimedia 
telecommunication services. 

Following the ITU-T terminology, an access network is comprising those 
network entities which provide the required capabilities for the provision of 
telecommunication services between the local exchange (Lex) and the subscribers; 
i.e. the related User Network Interface (UNI) 1 . 



1 According to ITU-T recommendation G.902 the access network is allocated between the so-called 
Service Node Interface (SNI) at the V reference point (at the Lex) and the User-Network Interface 
(UNI) at the T reference point at the subscriber premises (G.902, 1995). The VB5.1 interface provides 
access for multiple users of an access network to the service node. The VB5.2 interface has the 
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Especially the liberalisation of the European telecommunication market 
January 1 st , 1998 has resulted in a much stronger competition between network 
operators as well as manufacturers. Thus, also the access network will be 
influenced by this new framework. New access network technologies have to be 
able to offer traditional services very cheap and to realise new multimedia services 
at the same time with a reasonable price. The most important factors for the design 
of new technologies for the access network are: 

• Minimising the Operation and Maintenance (OAM) costs. 

• Improvement of the Quality of Services (QoS) by improving the fault and 
service management capabilities for example. 

• Offering of new services: Incumbent network operators are looking for new 
revenues and are trying to protect their existing customer base. New operators 
have the intention to gain market shares by offering new services. Thus, the 
new access network has to support existing narrowband services like 
Telephony, ISDN and 64 kbit/s to 2 Mbit/s leased line services, as well as new 
broadband services like fast Internet access and digital video transmission. 

Thus, new access networks have to be flexible in order to support the 
increasing bandwidth demand as well as any type of service. 



3 TODAY'S PSTN-ACCESS NETWORK 

The physical access infrastructure of the classical Public Switched Telephone 
Network (PSTN) is based on copper twisted pair lines (a/b-lines), which connect 
the subscribers to the next local exchange. In most industrial countries around 
99 % of the households are connected by copper twisted pairs to the PSTN in order 
to offer the classical narrowband telecommunication services. Only in rural areas, 
wireless access systems are installed as well. Today wireless access networks are 
investigated in general to offer mobility as an additional service value to the 
subscribers. 

From a main distribution trunk within the local exchange (Lex), cables with some 
400 to 600 copper wires are used to finally connect the subscribers with single 
twisted pairs in a point-to-point star architecture. A typical distance from the Lex 
to the subscriber is around 4 km. In rural areas some longer distances are used of 
course. 

The PSTN was designed to provide telephony services to the subscribers. The 
early users needed only the set-up of connections to different locations, so they 
could talk to each other. This is the Plain Old Telephone Service (POTS). 
Basically, the provision of POTS is still the rationale behind the telephone network 
and operators still generate most of their incomes from POTS services. 



additional capability of controlling the access network from the service node via a dedicated protocol in 
such a way that it is possible to have concentration within the access network on a per call basis. 
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A digital service, offered today over the PSTN, is realised by the Integrated 
Services Digital Network (ISDN). ISDN is offering a common interface for voice 
and data services up to 2 Mbit/s. 

Further on, the PSTN today is also used for different data services, like the 
access to the Internet, by using voice-band modems with capacities of up to 56 
kbit/s through the 4 kHz voice channel. 

As already mentioned above, network operators still generate most of their 
income from POTS services. However, the need for new more sophisticated 
services based on Internet traffic will change the importance of the role of the 
PSTN. This new situation will have a tremendous impact on the architecture and 
technology of the new access network architecture. 



4 USE OF FIBRE IN THE ACCESS NETWORK 

Optical fibre in the access network is a prerequisite to enable the necessary 
bandwidth for a final full service network. However, the key question is: how far in 
the access network is it economically viable to go. 

Dependent on the use of fibre in the access network, and thus the location of the 
optical network termination, i.e. the location of the so called Optical Network Unit 
(ONU), we have the following network architectures: 

• FTTH (Fibre To The Home): The fibre is terminated at the customer premises. 

• FTTB (Fibre To The Building): In this case the ONU is located in the 
basement of the served building, and may be up to 500 m from the customer 
apartment or office. 

• FTTC (Fibre To The Curb): In this case the ONU may be up to 500 m from 
the customer premises. A longer distance between the customer and the ONU 
requires a much less fibre deployment. Such architectures are also called Fibre 
to the Cabinet (FTTCa), respectively Fibre to the Node (FTTN), Fibre To The 
service area (FTTSA) and Fibre to the last amplifier (FTTLA) in CATV 
networks. 

The trend is to bring fibre nearer and nearer to the home in order to increase the 
capacity, improve the reliability as well as get very low error rates of the network 
infrastructure. However, FTTH have been proved very expensive, due to the high 
cost of civil works and the low customer share of optics and electronics equipment. 

The new installation of cables, independent whether fibre or copper, is - if no 
appropriate duct is available - the largest cost factor in a network. FTTC/FFTB 
offers a good compromise, taking advantage of the high bandwidth of fibre, while 
utilising the existing asset of the local loop in the last kilometre. In FTTB/FTTC 
network architectures copper twisted pair or coaxial cable is used to cover the last 
section to the terminal equipment. The different technological solutions are 
discussed below. 

Optical networks are supporting the transmission of analog as well as digital 
signals. The network topology of an optical network is either based on a point-to- 
point configuration or a point-to-multipoint configuration which might result in a 
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tree-structure. Based on the use of electrical amplification of the signal within the 
network we have to differentiate between an Active Optical Network (AON) and a 
Passive Optical Network (PON). 

An optical network (AON or PON) has to be considered as the most promising 
approach to achieving large-scale full service access network deployment that 
could meet the evolving service needs of network users. A PON is able to cover 
some 20 km. This distance depends on the number of used splitting points. A 
passive optical splitter is generating some power loss which results in a limited 
distance. Within an AON the optical signal will be amplified within the network 
which results in a much longer distance. In addition the AON offers additional 
concentration functions due to the switching capabilities at the splitting points. 



5 ACCESS TECHNOLOGIES FOR MULTIMEDIA SERVICES 

5.1 The technology choices 

Depending on the physical media used for the access network, different 

technologies will be employed (ETSI, 1995, Griffith, 1996): 

• Twisted-pair copper lines using Digital Subscriber Line (xDSL) technologies 
like HDSL, ADSL, and VDSL; 

• Coaxial cable by using HFC (Hybrid Fibre Coax) technology, where 
especially the use of fibre in the access network plays an essential role; 

• FTTB/FTTC network architectures using copper twisted pair or coaxial cable 
to cover the last section to the terminal equipment. A combination of 
fibre/copper is usually called a FITL (Fibre In The Loop) system. A FITL 
infrastructures is mainly used to offer narrowband services. 

Will the fibre be deployed up to the subscriber, a Fibre To The Home (FTTH) 
access is realised. 

• B-ISDN fibre access lines 2 . Since in such a scenario the fibre is always going 
up to the subscriber, we get a FTTB or a FTTH architecture. On such access 
lines, usually Metropolitan Area Network (MAN), Frame-Relay or ATM 
networks are connected. 

• Wireless systems, terrestrial or by satellite. 

5.2 Green field situation or use of the existing access infrastructure 

By the development of new access technologies, it is important to note that there 

are practical and sensible compromises between cost, functions and performance. 

The choice of the appropriate access technology is based on two factors: 

• Services to be offered: narrowband or broadband services, services for 
households and SMEs or services for large business users. 



2 B-ISDN UNI specifications are available for 155 Mbit/s, 622 Mbit/s, 1.5 Mbit/s , 2 Mbit/s, 51 Mbit/s, 
and 25.6 Mbit/s; see ITU-T recommendation 1.432.1-5. 
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• Deployment of a new access infrastructure or reuse of an existing access 
infrastructure. In a so called „green field situation” there is no existing access 
infrastructure available. However, such a situation is not very often the case in 
industrial countries. 

In a green field situation FITL- or HFC-networks are deployed usually. HFC 
technology will be used if video broadcast is part of the offered service. A FITL 
access infrastructure is appropriate if narrowband services like POTS, ISDN and 
leased line services up to 2 Mbit/s are of main importance. If a quick service 
offering of narrowband services and mobility is of main importance, wireless 
technologies are appropriate. 

The owners of access infrastructures (copper or coax) are driven to capture 
profits from new services. In this way, they might try to invest as little as possible 
and go for immediate results. In this approach, the linearity of the investment 
against the possible return is important. This is even more true looking to the 
uncertainty of market share for new services. Thus the reusability of the already 
available access infrastructure is of main importance. For the reuse of an existing 
physical infrastructure (coax cable or copper twisted pair), the following 
technologies have to be considered: 

• HFC-technology, if video broadcast services are part of the offered services. 

• FITL-technology if some part of the existing copper infrastructure has to be 
upgraded by fibre and the offered services are mainly POTS, ISDN and nx64 
kbit/s (up to 2 Mbit/s) leased line services. 

• xDSL-technologies, if broadband services to the subscriber have to be offered 
via the point-to-point twisted-pair copper lines. 

The different technological solutions are discussed in the following. 

5.3 Direct fibre access (FFTH/FTTB) and Fibre In The Loop (FITL) 
systems 

As highlighted above, a direct broadband fibre access (FTTH/FTTB) which offers 
an enormous capacity for the offering of multiihedia services, is economically just 
feasible for very large business customers, which are requesting symmetrical 
access and high capacity to a backbone network. However, recent standardisation 
activities are aiming at developing very cost effective optical network terminations 
(FSAN, 1997). Such a cheap optical network termination is the prerequisite for 
realising any cost effective FFTH infrastructure in the future. 

A FITL (Fibre In The Loop) system is a combination of fibre cables feeding 
neighbourhood Optical Network Units (ONUs) and last mile connections by 
existing or new copper wires. A FITL infrastructure is mainly used to offer 
narrowband services like POTS, ISDN and leased line services up to 2 Mbit/s. The 
Deutsche Bundespost Telekom project in the eastern part of Germany is a good 
example of a large FITL deployment, based on a PON architecture, in a green field 
situation (Hytas, 1995). 
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The access technologies which are based on the reuse of existing infrastructure 
and which are the basis for offering multimedia services on a broad range to the 
subscribers are discussed in the following sections: xDSL and HFC. 



6 DIGITAL SUBSCRIBER LINE (DSL) TECHNOLOGIES 
6.1 xDSL Technologies: HDSL, ADSL, VDSL 

It is important to note that ISDN was the first technology to transport digital 
signals up to 2 Mbit/s over the twisted pair access. The new available Digital 
Subscriber Line (DSL) transmission techniques like HDSL (High bit rate Digital 
Subscriber Line), ADSL (Asymmetric Digital Subscriber Line) and VDSL (Very 
high speed Digital Subscriber Line) can deliver data at multi-Mbit/s over the 
unscreened, twisted telephone wires, originally intended for bandwidths of 
between 300 Hz and 3,4 kHz. This is due the remarkable advances in digital signal 
processing technology. These allow the implementation of sophisticated channel 
modulation techniques which suggest that there are no fundamental technological 
barriers to overcome, at least on the digital side. In fact the challenges are mainly 
of an analog nature, set by the external electrical environments; i.e. the imposed 
noise interference which will degrade the Signal-to-Noise Ratio (SNR) . 

For the development of these new technologies, emphasis is not only given to 
maximum bitrate possibilities, but also on the robustness of the physical transport 
medium as well as different service characteristics.' For that reason, different 
modulation techniques are used throughout the solutions. 

HDSL ( High bit rate Digital Subscriber Line ) 

HDSL realises the transmission of 2 Mbit/s over a copper twisted pair in both 
directions (upstream and downstream) and offers usually a standardised G.703 
interface to the subscriber. To improve the max. possible distance for the 
transmission of the 2 Mbit/s, 2 or 3 copper twisted pairs can be used in parallel. 
Since the bandwidth capacity is the same in both directions, this is also called a 
SDSL (Symmetric Digital Subscriber Line) technology. HDSL is becoming a well 
accepted technology to offer especially for business customers 2 Mbit/s leased line 
services . The use of the HDSL technology for the realisation of a multimedia 
application is described in (Leopold, 1996). 

ADSL (Asymmetric Digital Subscriber Line) 

An ADSL system connects an ADSL modem pair through a twisted pair copper 
line, creating a high bit rate downstream channel and a medium bit rate upstream 
channel (Chen, 1994). The high bit rate downstream channel ranges from 1,5 - 7,5 
Mbit/s, while the upstream channel ranges from 16 to 640 kbit/s. This is achieved 
without disturbing the POTS service already installed on the line. The POTS 



3 Also 64 kbit/s leased line services are technically feasible; however there is no clear market 
requirement for such a service up to now. 
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compatibility is achieved by using a higher frequency band for the digital signals 
than used for the analog telephone signal (i.e. above 4 kHz up to 1 MHz). The 
analog signals are separated from the digital signals by a so called „POTS splitter 44 . 

The downstream high-speed channel is based on the assumption that most 
residential high-speed services will be asymmetric. The business users requiring 
symmetric services will install fibre for bi-directional data transfer. The 
downstream data rates depend on a number of factors, including the length of the 
copper line, its wire gauge and cross-coupled interference. Line attenuation 
increases with line length and frequency and decreases as wire diameter increases. 

VDSL ( Very high speed Digital Subscriber Line) 

VDSL is a complimentary development of the ADSL technology. By use of VDSL 
a much higher bandwidth will be achieved by a much less distance to the customer: 
up to 25 Mbit/s in downstream direction and up to 2 Mbit/s upstream direction at 
distances of up to 1 km. This dimensioning makes the VDSL technology to a good 
extension of a FTTCa-architecture. However, VDSL is still in the definition phase 
and further developments have to be expected. 

ISDL (ISDN Digital Subscriber Line) 

A further xDSL variant is to integrate ISDN and ADSL. However, such an ISDN 
Digital Subscriber Line (ISDL) approach is only useful for the integration with an 
ISDN basic access. The ISDN-primary access (2 Mbit/s) is usually used to connect 
PABXs with the PSTN. The ADSL technology has a complete different objective: 
to bring multimedia capacity to the subscriber (i.e. up to 6 Mbit/s). However, 
standards are not finalised yet, and the market acceptance of such an integration 
has to be verified very carefully. 

ADSL-Lite 

In order to improve the market acceptance, a special ADSL variant has been 
developed. ADSL-Lite has the objective to reduce the installation effort at the 
subscriber premises by allowing a simple plug in by the subscriber in any wall- 
outlet in the home, just as usual base-band modems. This is achieved by 
eliminating the POTS splitter, but results also in a compromise concerning the 
ADSL performance. 

6.2 Modulation Techniques 

In fact modern coding and modulation techniques provide a level of performance 
that approaches the theoretical limit of the physical bandwidth. New ADSL 
developments are using Discrete Multi-Tone (DMT) modulation techniques. DMT 
is a form of multicarrier modulation. In the frequency domain, DMT divides the 
channel into a large number of sub-channels. Only those frequency channels will 
be used which are not disturbed. These channels will be identified during the 
initialisation phase and will be permanently checked again during operation in 
order to adapt to a changing environment (i.e. highly robust in noisy 
environments). This technique is called rate adaptive ADSL (RADSL). 
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Another technique, the so-called Carrier Amplitude Phase (CAP) modulation 
technique, is under discussion within standardisation as well. However, DMT is an 
accepted technology and thus standardised within ANSI T1E1.4. 

6.3 ADSL and ATM Integration 

Modern ADSL products, are integrating ATM and ADSL technology. The main 
advantages by such an architecture in the access network are the following: 

• Support of different traffic characteristics (continuous bit streams, data bursts, 
etc.); 

• Any granularity of bit rate for different user channels. 

• Multiple QoS levels per subscriber (ATM connections), allowing a service 
range from low cost residential to premium business at a guaranteed quality 
level. 

• Combining DMT and ATM provides a flexible bit rate, exploiting the 
maximum line transfer capacity. 

• The offering of different interface types. Beside Ethernet for the traditional 
connectivity of PCs and Network Computers (NCs), the ATM-Forum 25,6 
Mbit/s has been established as a new broadband standard interface at the users 
premises. 

6.4 ADSL System Architecture 

The ADSL access network encompasses the ADSL modems and the access 
multiplexer system at the local exchange (Lex) and the ADSL modems at customer 
premise connected via the local loop. The ADSL modem at the Lex side is also 
called the ADSL-LT (line termination) and the modem at the subscriber premises 
is called ADSL-NT (network termination). 

The access multiplexer system and the modems at the Lex side are usually 
combined into a single unit called the access node (using the ADSL Forum 
terminology) and also referred to as the “DSLAM” (DSL Access Multiplexer) or 
Subscriber Access Multiplexer (SAM). When the backbone network is ATM, the 
access node is connected to an ATM access switch. The ADSL access node and 
ATM access switch may or may not be co-located. The function of the ATM 
access switch is to concentrate and switch traffic from a number of access nodes 
onto the regional broadband network. The ADSL access node (DSLAM) performs 
the following functions: 

• Line Termination (LT) of the ADSL subscriber lines. 

• Concentration/multiplexing of the ADSL subscriber lines towards the 
broadband network. WAN interfaces such as a STM-1 are expensive 
resources for an operator. It is important to concentrate as many subscriber 
lines as possible onto a single network interface. A multiplexing scheme that 
provides high concentration while guaranteeing the individually negotiated 
QoS will be an important asset for network operators, because it will allow 
them to offer differentiated services at a reasonable cost. 
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• Termination of customer ATM signalling channels. To provide a standard, 
scaleable mechanism for supporting switched virtual circuit service to ADSL 
customers, the ADSL access node should terminate the ATM signalling 
protocol from each ADSL customer, and generate a single standard UNI 
signalling interface toward the access ATM switch. 

At the subscriber premises the installation of a network termination (ADSL-NT) is 
required to which a LAN, a PC, a Network Computer (NC) or a TV set with an 
appropriate Set Top Box (STB) is connected. 

As shown in Figure 3, the subscriber has now the usual PSTN access for POTS 
services, an access to the Internet and On-line services to an Internet Service 
Provider (ISP) as well as an access to digital video contents. 




Figure 1. ADSL network architecture 



6.5 The business opportunities of xDSL Technologies 

To summarise, the business opportunities of xDSL-Technologies are: 

• Use of the existing PSTN access network resulting in low installation cost 
which allows the quick connection of new users and allows early service 
deployment. 

• Costs are strictly proportional with the number of connected customers. Thus, 
a low upfront investment is necessary at the one hand, which limits the 
financial risk of the operator. 
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• Sufficient capacity for multimedia services with no impact on the telephony 
service at the same line. 

• Bandwidth asymmetry is well suited for typical client/server applications. 

7 CAW-OPERATORS AND HYBRID FIBRE COAX (HFC) ACCESS 
NETWORKS 

7.1 The CATY business 

Basically the broadcast TV business is „entertainment“. Thus, the customer is not 
really willing to pay for the transmission of signals, but he is prepared to pay for 
the content he receives. Comparing to the traditional telecom business, other 
revenues than transmission tariffs are necessary. Thus the prime revenue streams 
are based on advertising, licenses, basic cable fees, and pay TV. However, it has to 
be noted, that the traditional TV business is a low margin business with long return 
of investments. 

It seems to be obvious that CATV operators have to protect first of all their core 
business against Direct Broadcast Satellite (DBS) services especially stimulated by 
the new digital TV services, while exploring new business areas like data 
communications, Internet access, online services, and to look for new market 
shares in the classical telecommunication business to ensure growing revenues in 
the long term. 

However, the main problem that CATV operators have to consider is the move in 
mindset needed to take advantage of the new market opportunities. Entering the 
new telecommunication business means more than just a technological upgrading 
of the network infrastructure. It also means the definition of new services and 
further on establishing relationships with customers through marketing, customer 
service, etc. CATV operators have to learn to think more like telecommunications 
companies. Such a need for cultural change is a potential critical factor for CATV 
operators. 

7.2 The new business opportunity of a HFC network 

Traditional CATV networks make extensive use of coaxial cooper cable to carry 
the signal from the head-end (HE) to the subscriber. Along the path amplifiers are 
deployed to compensate for the attenuation of the long coax cable lines. Each 
broadcast TV signal takes 6 MHz for a NTSC signal or 8 MHz for a PAL signal 
here in Europe. The bandwidth available on these pure coax networks usually 
range from 300 to 500 MHz. 

During the 80s and 90s cable operators began upgrading their networks to a 
Hybrid Fibre Coax (HFC) architecture to provide higher quality, increased 
programming, and new services. These new networks combine high capacity fibre 
lines with inexpensive coax lines and are the basis for establishing bi-directional 
capabilities. 
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HFC based networks have Broadband Optical Network Terminations (BONTs) 
that typically serve some hundred to some thousand homes by a coax cable; i.e. 
serving area of a so called “coax cell”. 500 homes per coax cell are seen as the 
optimised configuration to offer the whole set of possible services from analogue 
broadcast TV to professional telecommunication services. Such a modern HFC 
network is able to offer the following set of services for example: 

• 40 analogue TV broadcast channels, 

• 200 broadcast digital TV channels, 

• 400 on-demand channels (Near Video On Demand (NVOD, etc.)), 

• Internet access via the TV set, 

• Internet access and high-speed digital data services via LANs/PCs, and voice 
telephony services. 

Moving to full service providers, digital TV, data communications, Internet access, 
on-line services, and telephony services, allow CATV operators to grow their 
business by 

• protecting their core business against the DBS competition, 

• entering new business areas, 

• competing with the traditional telecom operators, and 

• offering the access for the „last mile“ for other network operators which don’t 
have direct access to the customers. 

7.3 Services to be offered on CATV networks 

There are four service classes to be offered on CATV networks. Figure 2 shows 
the different functional blocks at the subscriber premises 4 : 

Analogue TV broadcast 

This is the traditional means of transmitting analogue TV signals to subscribers. 
Analogue audio distribution and information services via Teletext are established 
as a well accepted services by the subscribers. The Teletext service is based on the 
distribution of information (text and graphics) in a broadcasting communication 
mode without any user interactivity over the network. 

Based on these broadcast services, either analogue or digital, first interactivity is 
realised basically by using the PSTN POTS service. Thus new services like Home 
Order Television (HOT) which allows the so called “impulse tele- shopping” 
applications, and information on demand services based on Teletext are offered. 

Pay-TV services based on a simple scrambling of the analogue TV signals, is one 
of the new services being offered by many CATV operators. For transportation of 
the control information form the subscriber to the CATV operator, usually the 
POTS service of the PSTN is used. For such a Pay-TV service, the customer 
requires an analogue STB in order to decode the scrambled TV signal 5 . 



4 The functional modules do not necessarily relate to implementations; i.e. several functional modules 
might be combined within one implementation in the future. 

5 E.g. The Pay-TV channel Premiere. 
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Digital TV and interactive multimedia services 

For digital TV broadcast based on the Digital Video Broadcast (DVB) standards 
and for interactive multimedia services based on the Digital Audio Video Council 
(DA VIC) specifications, the subscriber requires a DVB compliant digital STB in 
order to decode the digital MPEG-2 encoded video streams. Digital TV broadcast 
(audio and video) is basically offering the same services as for the analogue signal 
transmission. However, due to the digital transmission a very much higher capacity 
in the network is achieved. Dependent on the coding schema and the used 
bandwidth for the digital video signal 6 , this allows the transmission of 
approximately 8 digital channels per 8 MHz 7 . Thus Near Video on Demand 
(NVOD), and distribution of selected programs to a specific set of customers are 
the major services planned. For such services, no interactivity over the network is 
required. 

If the CATV network is offering bi-directional functionality, the communication 
from the customer to the CATV operator is done via the CATV network, otherwise 
the use of the PSTN is possible as well. 

Based on such interactive services, Pay per View (PPV), Internet access, on-line 
services and interactive multimedia services can be offered to the subscriber 
(Furth, 1995). 

Cable telephony services 

Real interactivity based on a return channel within the CATV infrastructure is 
necessary for the offering of telecommunication services like POTS, ISDN and 64 
kbit/s to 2 Mbit/s leased line services. These services can be seen as „classical 
telecom services", offering guaranteed QoS, which are implemented over the 
CATV infrastructure by a so called „cable telephony" system (Fletcher, 1997). All 
services are based on 64 kbit/s channels. At the subscriber premises a so called 
Coax Network Termination (CNT) is installed to offer the different interfaces for 
POTS, ISDN and the nx64 kbit/s leased line services. The bandwidth capacity is 
usually below 2 Mbit/s in both directions. This service class allows the CATV 
operator to get into the lucrative telephony business. 

Data communication services 

These service class is based on offering LAN interfaces at the subscriber premises 
like Ethernet in order to allow Internet access and online-services by the use of a 
PC and LAN interconnection to realise intranetworks. This is implemented by a so 
called „cable data modem" system. 

It is important to note, that the users of a cable data modem systems are using a 
shared medium (i.e. a legacy LAN), which means, that the offered bandwidth of 
the 10 Mbit/s Ethernet for example, have to be shared by all users related to this 
LAN (Laubach, 1996). Like in a legacy LAN environment, there are no QoS 
guarantees provided by the network. 



6 The transmission bandwidth required (e.g. 2, 4, 6 or 8 Mbit/s), depends on the intended video quality. 

7 By the use of a QAM-64 modulation, 32 Mbit/s can be transported within a 8 MHz channel. 
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During the last few years much effort has been made by international 
organisations and companies to build suitable physical and access protocols 
(MAC) for interactive services using the CATV network for data communications 
and the development is still ongoing. Within the European Digital Video 
Broadcasting (DVB) Project the Return-Channel Working Group has developed a 
complete specification for a physical and MAC protocol in cooperation with the 
Digital Audio Council (DA VIC). 

In North America several large cable companies have joined forces to build the 
Multimedia Cable Network Systems (MCNS). MCNS have set up the Data Over 
Cable Service Interface Specification (DOCSIS). IEEE has created a working 
group 802.14 that is also developing a standard for a physical and a MAC protocol. 
The MCNS is focusing on the pure Internet environment and is advanced very 
well, while the DVB/DA VIC and the IEEE activities are focusing on the use of 
ATM. Which standard will succeed on the market finally is still not clear. 




Figure 2: Services over the CATV infrastructure 
ATM on HFC Networks 

In order to support the different traffic characteristics by a flexible communication 
architecture, the use of ATM in HFC networks is investigated within the EU ACTS 
project ATHOC. For more details refer to Bottle, 1997a and Bottle, 1997b. 
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7.4 Evolution of CATV to modern Telecommunication Networks: 

HFC 

To allow the realisation of new services, CATV network infrastructures have to be 
upgraded in order to provide new capabilities for the subscribers. The starting point 
for the CATV operators are the existing coax architecture. Purely coaxial CATV 
networks evolve to Hybrid Fibre Coax (HFC) networks, by implementing 

• a bi-directional network, with capacity for interactive services, 

• new electrical amplifiers for the coax cable infrastructure, working up to 800 
MHz to offer enough capacity for all services, and 

• a fibre overlay network to improve the reliability and to bring the transmission 
capacity as near as possible to the customer. 

HFC - Physical Network Infrastructure 

The Hybrid Fibre Coaxial (HFC) architecture has become a standard for CATV 
operators. This architecture does not contain switching elements in the distribution 
network, and only requires optical-to-electrical conversion, amplification and 
power splitting. Thus every customer receives the same signal, which contains the 
information of all the services provided. 

A typical HFC network has a fibre star point-to-multipoint subnetwork and a 
tree-and-branch coaxial subnetwork. The Head-End (HE) receives, modulates and 
transmits the CATV channels over the fibre network towards the Broadband 
Optical Network Termination (BONT), where the signals are converted back to 
electrical signals to serve the customers by coaxial cable. Thus, from a physical 
point of view, a HFC network is made of two parts: 

• An optical network, which runs from the optical transmitter at the network 
head-end to a point close to the customer, in a star or multiple star 
architecture. The fibre part of the HFC network is a Passive Optical Network 
(PON) or an Active optical network (AON) usually with two fibres, one for 
each direction of transmission (also called „two fibre system”). Using optical 
fibres to a large extend, instead of the coax infrastructure is improving in 
addition reliability and maintenance expenses. 

• A coaxial branch-and-tree network, which connect the termination of the 
optical network (BONT) with each customer. The BONT performs the 
transition between multiple main coax cable and the optical fibre. If the 
serving area of the coax infrastructure is up to 250 m only, no amplifiers are 
used. This is the recommended approach for green field situations. 

Typically some intermediate points between the head-end and the BONTs have 
been introduced which allow a hierarchical structuring of the fibre distribution 
network; so called Distribution Hubs (DH) (see Figure 3). These nodes perform 
optical signal amplification and optical signal splitting. These nodes could be seen 
as equivalent to the local exchanges of the telephone networks and might be the 
points where new equipment for the transmission of new services will be placed 
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(further obvious locations for the new equipment are the HE as well as the 
BONT 8 ). 




Figure 3. HFC network architecture: HE, DH, BONTs, coax amplifiers (LE) 

Signal transmission 

A HFC network transports the information (analogue and digital signals) by using 
frequency multiplexed analogue carriers (FDM). Each TV channel is modulated on 
a different Radio Frequency (RF) carrier; allocating 8 MHz per analogue TV 
channel (European standard; PAL). 

In a traditional CATV network, usually the transmission is just done uni- 
directional from the network to the user (downstream) by a broadcast 
communication mode (since the very first usage of this infrastructure is analogue 
video distribution). This limitation is usually due to the uni-directional amplifiers 
within the coax network. These amplifiers are working in the frequency range up to 
302, 446, 600 or 860 MHz. After upgrading or replacing all the uni-directional 
coax-amplifiers, the infrastructure is able to transport the signals bi-directional. 
However, the signals transmission is not symmetrical. 

The upper part of the spectrum is used for downstream transmission and the 
signal is distributed to all subscribers in a broadcast communication mode. 
Encryption and/or other measures to control the access to the information have to 
be used for services other then broadcast TV to ensure appropriate security. The 



Although usually there is no space for additional equipment available at the BONT location. 
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lower part of the spectrum is used for upstream transmission. A dedicated 
protocol is required for sharing the available capacity (see below). 

Upstream signal transmission ( return channel) 

The return channel employed will have significant impact on the services offered 
and their degree of interactivity. Although it is not imperative to implement the 
return channel in the same network as the downstream channels, doing so leads to 
a homogeneous and user friendly solution. 

The upstream signals that must be transmitted are: network supervision signals, 
alarms, etc. coming from coax amplifiers, BONTs, DHs, or other network elements 
dependent on the service implemented on the HFC network, in order to integrate 
the HFC part in a powerful network management system, and signals for the 
interactive services like voice channels for telephony services, upstream channels 
for data services, return channel for interactive multimedia services. 

In the CATV network the physical medium - the coaxial cable - is shared by all 
subscribers. Therefore, an access procedure is required which provides collision 
free access to the shared medium and assigns capacity according to the customer’s 
requirements. Time division multiple access (TDMA) and frequency division 
multiplex (FDM) techniques are used to allow the return path to be shared among 
the subscribers. 

The return channel of the CATV network has some impairments which has to be 
considered for the final system design: 

• Limited bandwidth (some 50 MHz only) and modulation with low efficiency 
due to the usage of a robust modulation technique (a typical spectral efficiency 
of 1-2 bits/Herz. 

• Noise accumulation in the upstream direction from each customer termination 
with a coax cell. Common sources for external interference are radio signals 
coupled to the network, engine based interference, TV receivers, etc. 

Downstream signal transmission 

The downstream channel has a distribution architecture. No noise accumulation 
from different terminations is performed. For analogue signal transmission, the 
same modulation technique for terrestrial TV and FM radio services is used for 
compatibility reasons. For digital services (telephony, data communications, digital 
TV), a modulation technique providing high spectrum efficiency can be used. 
Typical examples are 64-, 32-, 64-, 128-, 256-QAM. These modulation techniques 
provide a spectrum efficiency up to 6 bits/Herz, thus resulting in several Mbps on a 
8 MHz channel 9 . 

According to the current standardisation discussions, 41 Mbit/s can be 
transmitted by means of 64-QAM in an 8 MHz wide CATV channel. Considering 
the redundancy needed for error protection (by a 204,188) Reed-Solomon code, 38 
Mbit/s remain for the transmission of user data. 



9 Capacity in a 8 MHz channel: 16-QAM: 25 Mbps; 32-QAM: 32 Mbps; 64-QAM: 38 Mbps; 128- 
QAM: 45 Mbps; although it has to be noted that these figures various from the different product 
implementations. 
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8 CONCLUSION 

Based on the studies conducted by the FSAN (FSAN, 1997), it was determined that 
the per-line cost of producing a full service access network will slowly decrease 
with volume of production. However, at a sufficiently high volume level the 
development of new technologies becomes justified that can enable significant 
reductions in per-line equipment and installation costs. 

However, no one knows for certain what path is socially optimal for residential 
adoption of broadband technology. Whatever the predictions, business decisions 
must be made in both the public and private sectors to determine the next step. 
Telcos and CATV operators have to continue to upgrade their networks both to 
achieve network efficiency and to compete in the near term for new and advanced 
communication services (Potts, 1997). Especially the new operators have to 
consider: 

• The installation of a wireless access system suffers from the limited capacity 
not supporting multimedia applications currently. 

• A new installation of an access infrastructure might be too expensive and thus 
economically not feasible. 

• The use of the copper lines of the PSTN access infrastructure from the 
incumbent operator (“unbundling”) depends on the interconnection fees. 

• The incumbent operators have a powerful potential access infrastructures by 
the use of xDSL technologies. 

• The use of the available CATV infrastructure might be an economical 
solution. 

The access network has a very essential role within a telecommunication 
infrastructure, and thus it brings an added value which is much more than just the 
transmission of information between the subscriber and the backbone network. 
Next steps should aim at demonstrating the availability of technologies and 
products for broadband access networks realising full service capabilities. This 
should also contribute to the effort in standardisation to enhance the definition for 
further system developments. 

Finally, it is important to note that only at a sufficiently high volume level the 
development of new technologies becomes justified. Thus a powerful access 
network infrastructure, which supports multimedia group communication services 
to a broad range of users; i.e. households, SMEs and business units, will be the 
prerequisite for any mass deployment of future multimedia services. 




22 



9 REFERENCES 

Berkowitz, P. (1997) The Changing Shape of the Access Network. 
TELECOMMUNICATIONS, Vol. 31, No. 10, October 1997 (pp 109-114). 

Bottle D. (1997a) Network Architecture for an ATM based Hybrid Fibre Coax 
System and related Applications. CWAS'97 International Workshop on 
Copper Wire Access Systems, Budapest, Hungary, October 27-29, 1997. 

Bottle, D. and Wahl, S. and Sierens, Chr. And Bastos J. and Borges I. and Frei P. 
and Christ, P. and Fahner H. and Ramlot G. (1997b) ATM Applications over 
Hybrid Fibre Coax Trials. ISS'97, Toronto, September 21-26, 1997. 

Chen, W.Y. and Waring, D.L. (1994) Applicability of ADSL to Support Video 
Dial Tone in the Copper Loop. IEEE Communications Magazine, May 1994 
(pp. 102-109). 

ETSI (1995) Video on Demand Network Aspects. ETSI/NA5, DTR/NA-52109, 
Technical Report, October 1995. 

Fletcher, M. (1997) Cable Telephony: Coming to Market. 

TELECOMMUNICATIONS, Vol. 31, No. 9, September 1997 (pp 81-83). 

FSAN (1997) Full Service Access Network (FSAN) Gx Initiative, summary of 
results, 25 February, 1997. 

Furth, B. and Kalra, D. and Kitson F.L. and Rodriguez, A.A. and Wall, W.E. 
(1995) Design Issues for Interactive Television Systems. COMPUTER, May 
1995. 

G.902 (1995) Access Networks - Architectures, Services, Arrangement and 
Service Node Aspects. ITU-T Recommendation, COM 13 -R41, July 1995. 

Griffith, M. and Guirao, F. and Van Noorden, L. (1996) Network Evolution for 
residential Broadband Interactive Services - From RACE to ACTS. European 
Conference on Multimedia Applications, Services and Techniques (ECMAST 
"96), Proceedings Part I, May 1996. 

HYTAS (1995) HYTAS: Ein zukunftsorientierter Netzzugang fur 

Multimediadienste”, Vortrag zur 36. Post- und Fernmeldetechnischen 
Fachtagung des VDPI - Multimedia - anbieten, transportieren, anwenden”, ke 
Kommunikations-Elektronik GmbH & Co, Februar 1995. 

Laubach, M. (1996) To foster residential area broadband internet technology: IP 
datagrams keep going, and going, and going .... Computer Communications 
19, 1996 (pp. 867-875). 

Leopold, H. and Hirn, R. The Bookshop Project: An Interactive Multimedia 
Application Case Study. In Proc. of the International COST237 Workshop on 
Multimedia Transport and Teleservices, D. Hutchison, A. Danthine, H. 
Leopold, G. Coulson (eds), Barcelona, Spain, November 1996 (Springer 
Verlag LNCS 1185, ISBN 3-540-62096-6). 

Potts, M (1997) Guideline NIG-G1: Broadband Deployment. ACTS, Network 
Integration Chain Group, September 1997 (http://ginaiihe.ac.be/). 

Sales, B. and Dumortier, P. and Van Mieghem, P. (1998) Dual-Mode Routing: A 
long term Strategy for IP over ATM. 6 th IEE International Conference on 
Telecommunications, Edinburgh, 29.3-1.4.1998. 




23 



10 BIOGRAPHY 

Helmut Leopold, born April 27th, 1963, in Hohenems, Austria, made his degree in 
1982 in Electronic and Communications at the Technical College in Rankweil, 
Austria. In 1989 he made the degree of Dipl.-Ing. Informatik (Computer Science) 
at the University of Technology of Vienna. From 1989 to 1994 he was responsible 
for the group „Multimedia Communications" at Alcatel Austria Research Center in 
Vienna and was actively involved in international standardisation and in European 
R&D projects in the broadband communication area. Since 1994, Mr. Leopold is 
with Alcatel Austria AG, extending his activities on marketing and usage of new 
multimedia services based on broadband technologies. Since 1996 he is account 
manager for the CATV-market in Austria. 




Performance of multiple access 
protocols in geo-stationary 
satellite systems 



Heba Koraitim , Samir Tohme f 
Marouane Berrada, Americo Brajal | 
f Ecole Nationale Superieure des Telecommunications 
46, rue Barrault 1 '5634 > Paris- France, 

Tel:33 1 45817449, Fax.:33 1 45891664 
e-mail : koraitim, tohme@res.enst.fr 
J Laboratoires d’Electronique Philips S.A.S. 

22, avenue Descartes - BP 15 - 94463 Limeil Brevannes Cedex France, 
Tel. :33 1 45106700, Fax.:33 1 45106960 
e-mail : berrada, brajal@lep.research.philips.com 



Abstract 

Two packet multiple access schemes, the DQRAP and the ARRA are mod- 
eled and evaluated in the geostationary satellite environment. A new protocol 
is afterwards proposed joining the advantages of both studied schemes, and 
more adapted to interactive multimedia applications over satellite uplinks. 
The Generalized Retransmission Announcement Protocol, GRAP, re-groups 
the immediate access by contention at low loads, and the reservation access 
for higher loads to achieve a better channel efficiency. Simulation results il- 
lustrate an improved throughput/delay characteristics and a higher protocol 
stability. Enhanced versions of the protocol are also proposed and evaluated 
to further improve its efficiency, with reasonable additional complexity. 
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1 INTRODUCTION 

Satellites systems have been recently rediscovered, to complement terrestrial 
networks in providing a worldwide access to the evolving multimedia services. 

Multiple access protocols constitute one of the most important aspects that 
largely determine the performance of communications networks. They define 
the means by which a subscriber can establish the contact to the network and 
hence gain access to its resources. Network subscribers are then competing 
in a pre-established manner to share the resources of a specific link. Many 
families of protocols have been proposed in the literature in different network 
contexts (Lee et al. 1983), (Lee et al. 1984), (Raychaudhuri et al. 1987), (Wong 
et al. 1991) and ( Mohammed et al. 1994). The features of these protocols differ 
according to the type of the considered network and its topology. 

Satellite networks have specific characteristics which largely influence the 
features of the multiple access scheme to be adopted. From one side, the inher- 
ent broadcast feature of satellite links enables the broadcast of information to 
all covered users at almost the same time. However, the large round-trip prop- 
agation delay characterizing the geostationary satellite environment imposes 
some limitations in protocol design. 

Two multiple access schemes are studied and analyzed in this work, the Dis- 
tributed Queueing Random Access Protocol (DQRAP), and the Announced 
Retransmission Random Access (ARRA) protocol. We have decided to com- 
pare these two as they rely on the same basic idea of allowing contention 
access at low loads and switching to reservation access at higher loads. Be- 
sides, both protocols separate new arrivals from retransmitted messages to 
avoid excessive conflict and collisions. In DQRAP (Xu et al. 1993, Koperda 
et al. 1995), this is done by blocking new arrivals from entering the system if 
a collision resolution cycle is in progress. In ARRA, on the other hand, new 
arrivals are allowed to transmit in a separate part of the frame called the 
common mini-slot pool (CMP) (Raychaudhuri 1985). 

These two protocols differ however in the collision resolution principle ap- 
plied in each case. While DQRAP adopts a tree-based approach (Capetanakis 
1979) for resolving collisions between reservation requests, ARRA protocol 
randomly arbitrates retransmission requests over the total number of frame 
slots. 

DQRAP was originally proposed for terrestrial hybrid/fiber coaxial net- 
works. We have therefore adapted the access algorithm to the satellite envi- 
ronment before modeling and examining the performance of both protocols on 
an uplink satellite channel. A new protocol, the Generalized Retransmission 
Announcement Protocol GRAP, is then proposed which further enhances the 
channel efficiency and achieves a better system stability at higher loads. 

In the next section, the two protocols, DQRAP and ARRA are briefly de- 
scribed, where the DQRAP adaptation to the satellite environment is particu- 
larly emphasized. The two protocols are modeled and their simulation results 
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are presented. In section 3, the GRAP protocol is introduced, and detailed, 
explaining the modifications added to enhance its performance. Section 4 re- 
veals the protocol models developed to extract the performance evaluation 
results which demonstrate the behavior of GRAP and E-GRAP and compare 
it to other simulated protocols. 

2 PROTOCOLS IN THE SATELLITE CONTEXT 

Due to the large round-trip propagation delay encountered in satellite links, 
the channel is always structured in the form of frames, whose length can 
vary to cover a part of, or the entire propagation time. In other words, the 
frame duration may be one or a multiple of channel frame durations. For both 
DQRAP and ARRA, the frame is divided into a number of equal length slots, 
where each slot is further divided into a data transmission slot (DS) and a 
number of control mini-slots (MS). 

2.1 Satellite DQRAP 

The frame structure of the DQRAP protocol is shown in figure 1. The frame 
duration is equal to the two-way round-trip propagation time from a user ter- 
minal to the network control center (NCC) and back again to the terminal. 
The status of the system is broadcasted to all terminals within the satellite 



Propagation delay 




RQ1 TQ1 RQ2 TQ2 RQ3 TQ3 RQ4 TQ4 



Figure 1 DQRAP frame structure for satellite links 

coverage by the NCC. Each active terminal keeps track of the channel status 
by listening to the feedback information of the NCC. A distributed queue- 
ing discipline is maintained by each terminal to memorize the status of each 
slot, and keep track of its own position in the queues. This discipline func- 
tions by storing the status information of each slot in two global queues, the 
transmission queue, TQ, and the collision resolution queue, RQ. 
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When a terminal generates a packet, it will listen to the information con- 
cerning the queues status of the upcoming slot after the packet arrival. If both 
queues are empty (slot 1 in figure 1), the packet will be transmitted in DS 
together with a reservation request transmitted in a randomly selected MS of 
the same slot. 

If the DS is successful, the corresponding reservation will be canceled. Oth- 
erwise, if the DS collides, while the MS is successful, the terminal then enters 
the (TQ) of the corresponding slot on the frame, and keeps track of its posi- 
tion in the distributed queue to transmit its data in the same slot of the next 
frame after the round-trip propagation delay. 

However, if both, the DS and MS suffer collisions (more frequent at higher 
loads), the terminal will be placed in the RQ , in the same position with all 
other terminals colliding in the same MS of the corresponding DS. Colliding 
requests are organized in the RQ by the order of the position of their MS in 
the slot (Collisions in MS 1 enter in first position , while those in MS 2 enter 
in the second position). 

The first group in the RQ will retransmit only their request in the same slot, 
after randomly selecting another MS. Request collisions in a certain slot are 
hence resolved according to a tree algorithm by progressively dividing colliding 
groups in mini-slots into smaller groups. Collision resolution trials are then 
separated by the round-trip propagation delay (or the frame duration), as 
they are always attempted in the same slot of the frame. 

Whenever a packet arrives in a terminal and learns, by the feedback infor- 
mation, that the RQ of the upcoming slot is empty, while the TQ is not (slot 
2 in the figure), the terminal transmits only a contending reservation request 
in one of the MSs of this slot. If however, the TQ is empty, while the RQ is 
not (slot 3), or if both are non-empty (slot 4), the packet will be held in an 
arrivals queue in the terminal until a slot on the frame is detected with an 
empty RQ. 

An occupied RQ indicates that a collision resolution cycle is in progress in 
this slot, and hence new arrivals are prohibited from entering the system. This 
measure serves to decouple new arrivals transmissions from retransmissions, 
thus increasing the protocol stability. 



2.2 Satellite ARRA 

In contrast to the DQRAP protocol, which was originally proposed for ter- 
restrial networks, the ARRA protocol initially targeted broadcast channels 
with a large round-trip propagation delay. The frame structure of ARRA is 
shown in figure 2, where the round-trip delay consists of a multiple of frame 
durations. 

The frame is divided into a number K of slots and each slot in turn, is 
divided into a data slot (DS) and K mini-slots (MS). Similar to the DQRAP 
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protocol, the length of an MS is much smaller than the length of a DS. Both, 
the DS and the MS, have the same roles as in DQRAP for transmitting the 
data packet and the reservation requests respectively. An additional field of 
K MSs constitute the first part of the frame to serve as a Common Mini-slot 
Pool (CMP). 



Slot #1 



Slot #2 



Slot #k 



CMP 

K minslots 
for reservation in 
following frames 



Data message 



K minislots 
for retransmission 
announcement 



Figure 2 ARRA frame structure for satellite links 



When a packet is generated, the terminal will wait for the beginning of 
the next frame to read the feedback information transmitted by the NCC 
concerning the frame. If there are free slots available for contention on this 
frame, it will choose a free slot at random and transmit its packet in the DS 
field of the slot. A reservation request is also simultaneously transmitted in 
one of the MSs of the same slot, to advertise the need for reservation if the 
data message collides. The MS within the slot is randomly chosen and the 
selected MS number indicates the slot in which retransmission will take place 
in the following frame, after the round-trip propagation delay. Each terminal 
is then announcing a reservation in case of collision. 

If no free slots are available on the frame, a terminal will only transmit a 
reservation request in one of the MSs of the Common Mini-slot Pool (CMP) 
at the beginning of the frame, to reserve the corresponding slot number in the 
upcoming frame. The feedback information regarding the status of the frame 
and the available slots for contention is broadcastes by the NCC before the 
beginning of each frame. 

No special measures are adopted in this protocol to resolve collisions be- 
tween retransmissions, except for the fact that each back-logged terminal in 
a collision will randomly choose a different slot each time by announcing in 
the corresponding MS number. This randomizes the retransmission trials over 
the whole available set of free slots on the frame. 
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2.3 Performance of DQRAP and ARRA 

The two protocols were modeled in a VSAT star network configuration, where 
a satellite with on-board processing capabilities ensures the NCC access con- 
trol functions. A Poisson model was considered to model the traffic generated 
by an infinite number of VSAT user terminals. 

DQRAP protocol was simulated with a frame duration equal to the round- 
trip propagation delay (0.27 sec.), consisting of 108 slots, which is also the 
number of interleaved DQRAP engines. On the other hand, a frame duration 
of 20 ms was considered for ARRA protocol, with 8 slots per frame ( K = 
8). This frame duration and number of slots per frame were chosen as a 
compromise, between the frame length and the overhead introduced by the 
mini-slots. Both, DQRAP and ARRA, have the same slot length, and were 
tested for the same uplink channel capacity. 

DQRAP protocol was modeled for two values of MS, the first equals three, 
while the second equals eight. The latter value was particularly considered in 
order to compare its performance to that of ARRA, when they both have the 
same number of MSs per slot. The additional overhead introduced by the MSs 
is not counted for in the results, as we are only considering the throughput of 
the data slots. 

We have noticed, as illustrated in figure 3, that the throughput of the ARRA 
saturates at a value of 0.425. This is due to the excessive number of col- 
lisions taking place between retransmissions and newly arriving reservation 
requests, in the limited number of mini-slots of the CMP. As the network 
loading exceeds 0.43, an unstable state is reached where collisions are multi- 
plying and the system output saturates. The throughput /delay characteristics 
of the ARRA protocol are shown in figure 4. 





Figure 3 ARRA throughput Figure 4 ARRA characteristics 



The performance of DQRAP with the two values of MS (3, 8) is presented 
in figure 5. Obviously, the DQRAP protocol largely outperforms the ARRA 
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protocol in terms of maximum channel throughput, which reaches 0.98. This 
is due to the complete separation of the transmission and collision resolution 
processes characterizing DQRAP. 




Figure 5 DQRAP throughput /delay characteristics 

The protocol is also stable, over a wide range of loading conditions, due 
to the application of the tree-based collision resolution scheme. It is noticed 
that increasing the number of MSs per slot from 3 to 8 has a slight effect on 
reducing the average data delay at lower channel loads, but has no significance 
at higher loads, as the two curves coincide at a load of 0.9. 



3 GENERALIZED RETRANSMISSION ANNOUNCEMENT 
PROTOCOL 

The Generalized Retransmission Announcement Protocol, GRAP, combines 
some aspects of the previously described protocols to develop an access scheme 
adapted to the satellite characteristics, and at the same time achieving an 
acceptable efficiency and stability. 

The frame structure of GRAP is shown in figure 6. Similar to the ARRA, 
the frame is divided into K slots, and each slot consists of a data slot field 
(DS) and K mini-slots (MS) field. The frame duration is chosen such that the 
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overhead introduced by the K MSs length is very small compared to the DS 
length. 



Reserved Subframe (SRS) 



Contention Subframe (SAS) 







Slot#l 


Slot #2 


Slot #3 




Slot #k 



Data slot 



Data slot 



K minislots 
for retransmission 
announcement 



K minislots 
for new arrivals 
announcement 



Figure 6 GRAP frame structure 



3.1 GRAP Access Procedure 

New arrivals When a user terminal becomes active, it will monitor the 
down-link stream, to read the status of the upcoming frame and the set of 
available slots (SAS) broadcasted by the NCC. At the beginning of the fol- 
lowing frame, if there are free slots available for contention, the terminal will 
transmit its data message, accompanied by an anticipating retransmission 
announcement placed in a randomly chosen MS of the same slot. The MS 
number in the slot indicates the slot number that the terminal requires to 
reserve in the next frame in case of collision. 

Retransmissions If an active terminal receives a feedback indicating that 
its packet has collided, it will search the SAS for the slot number of its an- 
nounced reservation. If the slot number is not included in the SAS, it assumes 
that its reservation has been successful and waits for the assigned slot to 
transmit its data message. 

On the contrary, if the announced slot is included in the SAS, it means that 
the reservation has not been successful due to collisions or other errors. The 
terminal then retransmits a reservation request in a randomly chosen MS of an 
already reserved slot, since reserved slots are not accompanied by anticipating 
announcements. In this case, the MSs associated with the set of reserved slots 
(SRS) play the role of the CMP field in the ARRA protocol and replace it, 
hence reducing the overhead associated with the protocol. Furthermore, the 
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number of MSs now available for request retransmissions is a multiple of the 
CMP, and this multiple is equal to the number of slots included in the SRS. 

This has the advantage that at high loads, when many slots are reserved, 
there is more space to retransmit reservation requests, and at low loads, more 
free slots will be available for contention. 



3.2 Flow control 

In order to limit the risk of repeated collisions at high network loads, a flow 
control mechanism relying on a back-off delay is introduced in the terminals’ 
access procedure. Before retransmitting a collided request, the terminal will 
have to wait for a certain amount of delay, before attempting the retransmis- 
sion. This delay is calculated as a function of the number of repeated collisions 
to which the terminal has been exposed and is directly proportional to it. 

Another control aspect is envisaged whenever the network load goes down, 
and free slots are again available on the frame, where colliding requests are 
allowed to recycle in the system as new arrivals. The recycling feature is offered 
to requests that have exceeded a certain number of repeated collisions, to limit 
the number of conflicts in MSs and hence improve the protocol stability. 

The number of repeated collisions, after which retransmissions are recycled 
as new arrivals in the system, is a parameter of the protocol that has to be 
adjusted for optimal operating efficiency. The recycling aspect not only re- 
duces the retransmission load in the MSs, but also reduces the retransmission 
delay and helps in re-partitioning the load between new arrivals and retrans- 
missions. 



3.3 NCC operation 

The NCC contains all the control and monitoring functions to organize and 
regulate the GRAP procedure. The uplink frame is completely received before 
any analysis and processing can be started. The NCC then treats the received 
frame and analyzes it before sending the feedback to all active terminals on 
the down-link. 

1. Frame analysis: The NCC first searches the DS fields of the frame slots 
to detect successful and collided messages. It then proceeds with the de- 
tection of collisions in the MSs. Successful reservations associated with 
successful contention messages are canceled, while those associated with 
collided messages are memorized, together with successful retransmission 
requests in the MSs of the reserved slots in the SRS. 

The NCC then allocates slots to successful reservations, and eliminates 
some of those who had a virtual collision (those who place their reservation 
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request in the same MS number of different slots). The elimination process 
can be done based on a priority or simply a random principle. 

The advantage of having a number of MSs equal to the number of slots on 
the frame, is the reduction of the amount of information carried by each 
MS. Hence, the NCC is not obliged to specify the terminal ID associated 
with each reservation. A terminal can just place a signal in the MS and 
wait to know whether the corresponding slot is included or not in the SAS. 

2. Feedback formulation: The set of available slots (SAS) is afterwards 
formulated, including all non reserved slots on the frame. On the down-link, 
this SAS is broadcasted, with all other feedback information concerning the 
reserved slots and their position on the frame. Each terminal then receives 
the result of its transmission trial, after the propagation and the NCC 
processing delays (which can consist of one or a multiple of the up-link 
frame duration). 



3.4 Enhanced GRAP 

In order to further enhance the performance of the protocol, a queueing system 
is added in the NCC to store virtually colliding reservation requests. The 
expression virtual collision designates the conflict occurring between terminals 
placing their reservation requests in the same MS number of two or more 
different frame slots. Although these requests can be correctly received at the 
NCC, their intention to reserve the same slot on the next frame causes the 
NCC to hold the reservation for only one terminal and cancel the others. 

The queueing system then tends to overcome this weakness by preserves 
the reservations for all virtually colliding requests. Two queueing disciplines 
can be envisaged. The first one comprises a single distributed queue for the 
whole frame, while the second comprises a number K of queues, one for each 
slot on the frame. 



(a) Single reservation queue 

In this scenario, only one reservation queue is maintained by the NCC to store 
virtually collided requests, for which no place can be found on the upcoming 
frame, and postpone their allocations for future frames. This can be modeled 
as shown in figure 7. 

The single queue follows a priority discipline, with the higher priority given 
to requests that have suffered more collisions and hence, more delay. At the 
NCC, these requests are scheduled at the beginning of the frame and are 
completely served before new arriving requests. If there remains some place 
afterwards on the frame, new requests can be accommodated. 

Since the remaining free slots on the frame may not correspond to the 
originally reserved ones for some waiting requests, the NCC must include the 
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Figure 7 Single queue E-GRAP model 



reserved slot number in its feedback information for each terminal, to notify 
it of its new reserved slot. 

The advantage of this approach is the overall reduction of the mean mes- 
sage delay at higher loads, since the priority is proportional to the number 
of repeated collisions. A tradeoff has then to be made between the encoun- 
tered mean delay and the maximum channel efficiency, since more feedback 
information has to be sent on the down-link. 

(b) Multiple reservation queues 

The second scenario envisaged to avoid virtual collisions is the introduction 
of a number of waiting queues to store successful reservation requests. This 
number is equal to the number of slots per frame K , and hence a waiting 
queue is envisaged per slot as illustrated in figure 8. 




Figure 8 Multiple queues E-GRAP model 
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When several terminals transmit their reservation requests in the same 
mini-slot number of two different slots, all the successful requests are retained 
and stored in the queue corresponding to the targeted slot. Their position in 
the queue may either be random, follows their position on the frame, or be 
decided following a certain priority discipline. 

The extra feedback information concerning the reserved slot number is not 
needed in this approach. This is however, at the expense of the possibility of 
suffering more delay, waiting for the same slot in future frames, while there 
may be other free slots on the present frame. 

The two approaches explained above help in reducing the number of re- 
transmissions and hence, possible future collisions. Therefore they confine the 
delay variation to a smaller value than that experienced by the simple GRAP. 



4 GRAP AND E-GRAP PERFORMANCE 




Figure 9 VSAT terminal model 

Simulation models have been developed for GRAP and E-GRAP to exam- 
ine their performance in the same VSAT network configuration, previously 
described to test DQRAP and ARRA. 

Simplified block diagrams for the VSAT terminal model and the NCC model 
are shown in figures 9 and 10. A frame duration of 20 ms and K = 8 slots per 
frame have been maintained for the system. 

Figure 11 illustrates the throughput /delay characteristics of the GRAP and 
the two versions of E-GRAP with a single and multiple queues. The GRAP 
achieves a channel throughput of 0.95 for a mean data delay that approaches 
2 seconds. It is clear that, introducing the queueing discipline has improved 




37 




Figure 10 NCC model 




Normalized throughput 



Figure 11 GRAP and E-GRAP characteristics 

the performance of GRAP by increasing the maximum channel throughput 
to 0.98, at the same time of reducing the mean data delay in the system. 
E-GRAP with a single queue further reduces the mean delay at high loads, 
compared to the multiple queues E-GRAP, due to the complete statistical 
multiplexing aspect on which it is based. In this version of the protocol then, 
the number of MS per slot can be different from the number of slots per frame, 
since a terminal can be reserved any slot on the frame. 




Probability density function 




Normalized throughput 



Figure 12 Performance comparison between DQRAP and E-GRAP 
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Figure 13 GRAP delay distribution 
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The additional delay introduced by the multiple queues E-GRAP at higher 
loads, is due to the fact that each terminal must wait for a specific slot that 
corresponds to the number of MS in which it has previously placed its reserva- 
tion. The advantage of this technique however, is that the overhead introduced 
by the MSs in the slot is reduced due to the transmission of less control and 
feedback information. 

Compared to the DQRAP protocol, as indicated in figure 12, both GRAP 
and E-GRAP outperform DQRAP, in terms of mean delay, up to a load of 
0.95. As the load exceeds the latter value, the improvement introduced by 
E-GRAP is significant where it attains a throughput value of 0.98 for a delay 
of only 1.4 seconds. This demonstrates the E-GRAP stability at very high 
loads. 

The delay distribution of GRAP, E-GRAP with single queue and E-GRAP 
with multiple queues, is measured and illustrated in figures 13, 14, and 15 
respectively for a 0.8 loading factor. The E-GRAP is shown to achieve a better 
delay distribution, where the delay density function is confined to 1.3 seconds 
at 0.8 load, compared to around 3 seconds for the GRAP. 

The distribution of the mean delay encountered by data packets in the 
DQRAP protocol, at the same load (0.8), is given in figures 16 and 17 for 
3 and 8 MS respectively. The distributions for both MS values have little 
difference between them, but they are more dispersed compared to those of 
GRAP and E-GRAP. The dispersion in the delay values is spread over a wider 
range reaching a value of 5 seconds. This is due to the tree algorithm applied 
to resolve collisions of reservation requests in DQRAP. 

The results indicate hence, that protocols applying tree-based collision res- 
olution algorithms are stable at high loads, but at the expense of additional 
delay and delay variation. Besides, the DQRAP protocol introduces more com- 
plex functions in the user terminals, since each terminal have to keep track of 
the state of each of the transmission queue TQ , and the collision resolution 
queue RQ for each slot on the frame. 

A back-off delay applied in the GRAP protocol improves the delay per- 
formance over that of DQRAP, with much less complexity. The enhanced 
versions of the protocol, with a queueing discipline, further enhances the pro- 
tocol performance, at the expense of managing one or multiple distributed 
transmission queues. 
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5 CONCLUSIONS 

We have modeled and analyzed two packet multiple access protocols in the 
satellite environment, the DQRAP and the ARRA protocol. The first is based 
on a tree collision resolution algorithm, which proceeds in parallel to the 
transmission process of packets in two seperate distributed queues for each 
slotl. The second simulated protocol relies on a simple retransmission policy 
to transmit reservation requests. 

A new protocol, the GRAP, was proposed as a compromise between both 
techniques. It relies on the principle of complete separation between new ar- 
rivals and retransmissions. New arrivals access the contention sub-frame, while 
the retransmissions contend in the mini-slots of the reservation sub-frame. The 
protocol, however, preserves the right for retransmissions to recycle as new 
arrivals when a certain number of collisions is encountered. 

The frame structure of the ARRA has then been adopted for the proposed 
GRAP, after removing the CMP field, and the distributed queueing policy of 
the E-GRAP was inspired by the DQRAP protocol. GRAP and E-GRAP are 
hence less complex than DQRAP as the whole system can keep track of just 
one queue (the reservation queue), while they exhibit a more complex feature 
due to the recycling aspect. This can be easily handled by VSAT terminals. 

GRAP was found to outperform both DQRAP and ARRA, where the en- 
hanced versions achieve a better performance and stability at very high loads. 
E-GRAP exhibists the lowest delay (and hence delay variation), which makes 
it better adapted to data traffic with real-time requirements, and multimedia 
applications, such as interactive on-demand services over Internet. 
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Abstract 

A Hybrid Fiber Coaxial (HFC) plant is typically configured in a tree topology and 
covers a large area with tens of thousands of House Holds Passed (HHP) and 
several return paths into the headend. During initial deployment, it is usually the 
case that the number of cable modem subscribers is small compared to the HHP, 
resulting in a small number of modems spread out over the return paths. To 
operate more efficiently, the return paths should be combined to reduce the port 
requirement at the headend. 

A simple upstream RF combiner can be used to merge separate return paths, but 
that would also funnel the noise of the separate paths and degrade performance. 
Instead, what is needed is an upstream aggregation device that will multiplex the 
paths without aggregating the noise. To do so, such a device needs to operate with 
the HFC’s MAC (media access control) layer and employ multiplexing discipline 
that incurs limited impact on performance under varying modem distribution 
scenarios. In addition, this device must be simple and transparent to the rest of the 
HFC system. 

In this paper, we describe through analysis and simulation how such a return path 
multiplexing device is possible, and how it impacts the HFC network architecture 
and upstream performance. 
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1 INTRODUCTION 

Depending on the topology of the HFC plant, the distribution of users, and the 
node size, the number of return paths into the headend of a HFC plant may vary. 
For example, a HFC plant with 64 000 House Holds Passed (HHP) will result in 32 
separate returns assuming a node size of 2000. Further recombination of these 
returns in each node using Fabre-Perot or Direct Feed-Back lasers may reduce the 
return port requirement by a factor of 4, yielding a realistic upstream port 
requirement of 8. A HFC plant with these components is illustrated in Figure 1. 




The number of active subscribers on each trunk is likely to be small in initial 
deployment, perhaps only 1%. Such typical sparse deployment means each return 
path only services a few modems. Nevertheless, each return path normally requires 
a port in the headend. Since the cost of headend equipment increases with the 
number of ports, this results in an inefficient utilization of headend resources. 

To increase efficiency, we need to concentrate the modems into fewer ports. To 
do so, we can either redesign the HFC plant topology or we can recombine the 
return trunks. The first solution is too costly and will backfire when more cable 
modem subscribers join. A simple upstream RF combiner can be used to combine 
the return paths, but that would also raise the noise floor at the headend ports due 
to a phenomenon called noise funneling. Noise funneling can be catastrophic to the 
performance of the HFC access network. 

Instead, what is needed is an upstream aggregation device that will recombine 
the paths without aggregating the noise. To do so, such a device needs to operate 
with the HFC’s MAC (media access control) layer and employ a multiplexing 
policy that incurs limited impact on performance under varying modem 
distribution scenarios. The complexity of this device must be low to realize the 
cost savings. In addition, like a passive RF combiner, this device must be 
transparent to the rest of the system. In this paper, we call such a device the Return 
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Multiplexer (RMUX), and illustrated in Figure 2 how it fits into a HFC access 
network. 




As we will describe in the next section, the RMUX introduces a new architecture 
for HFC systems. There are of course certain trade-offs and limitations in its use, 
but properly configured, the RMUX can dramatically reduce the headend port 
requirement with minimal performance degradation. In the remainder of this 
paper, we examine through simple analysis and detailed simulation how the 
RMUX performs under various configurations and load conditions. 

2 SYSTEM DESCRIPTION 

As described in the UPSTREAMS HFC MAC/PHY protocol specification 
(Laubach, 1997), upstream data in many deployed HFC access network is 
transported in fixed sized TDMA slots carrying ATM cell payloads, and the 
allocation of the slots is centrally scheduled by the headend. The RMUX, as a cell- 
level multiplexer, can therefore be controlled by the headend to open and close the 
corresponding path during the appropriate slots. This can be done without any loss 
of data cells. This is illustrated in Figure 3, where the RMUX will serve the 
incoming data cells in the port sequence of {1,2, 4,3}. In this way, the RMUX 
isolates each return path from the RF noise of other paths. 

Typical HFC protocol employs a request-grant mechanism for modems to 
acquire upstream data slots during preallocated phases of the protocol’s operation. 
During the request phase, a fixed number of slots are made available for modems 
to send requests upstream. Such requests are sent according to a multiple access 
protocol with a collision resolution mechanism similar to that 
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Figure 3: switched forwarding of data in a 4x1 RMUX 



of 802.3 Ethernet. However, the headend has no a priori knowledge of when or on 
which return path the next request will arrive. Consequently, such a cell may arrive 
at a port not currently listened to by the RMUX. A request cell that is not serviced 
by the RMUX is lost, and is referred to as a blocked cell. 

From the viewpoint of the transmitting cable modem, there is no difference 
between a blocked request cell and a request cell lost to collision. Therefore, the 
RMUX can be viewed as an additional contention mechanism during the request 
phase. To optimize the performance of the RMUX is to minimize the number of 
blocked cells, and to do so in a fair manner across the input ports of the RMUX. 

The manner in which cells are blocked at the RMUX depends on two main 
factors. One is the service discipline used in servicing the input ports of the RMUX 
during the request phase. The other is the load distribution across the input ports of 
the RMUX. We next discuss these two factors. 



2.1 RMUX Service Discipline 

The RMUX service discipline dictates which input port to service and at which 
time. The simplest form of service discipline is the round-robin, and there are 
several variations. 



• Round-Robin: by serving each input port in a cyclic manner and serve a 
constant number of arrivals during each visit, round-robin is a fair policy. 

• Grouped Round-Robin: if certain ports can be identified as lightly loaded or 
have high SNR and can afford to withstand some noise aggregation, groups of 
input ports can be serviced at the same time, thereby reducing the latency 
between the sojourn time between visits. Again, a fixed number of slots is 
served during each visit. 

Aside from round-robin, we can also vary the order of service of the input ports. 
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• Load Dependent Acyclic: the load on each input port can be used to determine 
how many consecutive slots are serviced each time and in what order the 
inputs are serviced. 

However, such acyclic policies are more complex to implement. We propose some 
in this paper that should be relatively easy to implement, but no performance 
results are available for them at this time. 

2.2 RMUX Input Load Distribution 

Assume the RMUX employs a round-robin policy. With the total number of 
modems fixed, the worst performance should occur when if all upstream traffic is 
grouped onto one input port. Reasoning similarly, the best performance will occur 
under an evenly distributed load across the input ports. 

The probability of arrival at the ports can be used to quantify the load 
distribution across the input ports. However, such characterization is not very 
useful to the HFC access network architect, since such fine grained information is 
typically not available. A more practical way to quantify the load distribution is to 
use the number of active modems on each port. If the HFC system used is capable 
of providing Quality of Service (QoS) levels delineated by minimum and 
maximum rates, like that of Com21’s system, then the subscribed minimum rates 
together with the number of modems can together be used to quantify the load on 
each port. For simplicity, we assume the QoS of all modems, if applicable, to be 
the same. 

Let the number of input ports be K, we define the parameter Index of Symmetry 
(IOS) as illustrated in Figure 4 to better quantify the symmetry of the input port 
loading. 



RMUX 

Kxl 






(K-l) X nmin 



N = total # of cable modems 



= nmax +(K-1) X nmin 



IOS (Index of Symmetry) = nmin/nmax 



Figure 4: Illustrate Index of Symmetry 
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Table lExample Index of Symmetry calculations 



K 


N 




n ■« 


IOS 


8 


128 


16 


16 


1.0 


8 


128 


0 


128 


0.0 


8 


128 


14 


30 


0.467 


4 


64 


14 


22 


0.636 



Using this definition: 

IOS = n/n (1) 

For the type of load distribution shown in Figure 4, we have 

n max = N/[(K-l)IOS +1] (2) 

which can be conveniently used in network planning. 

An IOS can be similarly defined for a system with bimodal load. That is, if the 
ports are partitioned into two group with group 1 having n max cable modems and 
group 2 having n min cable modems, where n max > n min , we can define the IOS to be 
n mi,/ n max- Through simulation, we have found that regardless of how nodes are 
distributed, the IOS definition in (1) allows us to establish at least approximate 
regions of operation that gives desirable performance. 

2.3 Implications 

The above discussion tells us that there are many design and configuration 
alternatives associated with the RMUX. There are many extensions and variations 
of service disciplines based on the ones proposed. For example, a load dependent 
round robin policy can vary the number of slots served depending on the number 
of modems active on each port. An optimal service discipline that delivers zero 
blocking under all load distributions may exist, but it will require accurate 
estimates of traffic pattern of all modems and coordinate closely with the headend 
scheduler. The complexity of the resulting RMUX will likely be prohibitive. We 
will demonstrate in the following that a simple RMUX can deliver quite decent 
performance. 

Even when the choice of service discipline and load distribution is not optimized 
to ensure no blocking, we must keep in mind of the alternatives facing the HFC 
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access network architect. If a RF combiner is used to aggregate 8 return paths, the 
signal to noise ratio (SNR) may decrease by as much as 9 dB. Even with the SNR 
of each return path at a favorable 20 dB or better, the impact on system 
performance is significant. 



3 ANALYSIS 

The RMUX can be modeled as a simple polling system where arrivals and service 
are slotted with slot duration D. Different from a typical polling system, however, 
the buffer size is zero and the switching time depends on the service time. We 
denote the nth time the server visits buffer i by t.(n), and the number of arrivals 
during this nth visit is a^n). The maximum number of slots serviced during that 
visit is denoted by s t (n), and the switching time from buffer i to the next buffer 
after t.(n) is denoted by w t (n). The value of w,(n) is given by 



w,(n) = D(s,(n) - a,(n)). (3) 

In Figure 5 is shown such a polling system with K buffers. The load on port i is 
represented by the arrival process x f . into buffer i. Each arrival process x ( describes 
the slotted arrival of requests into port i. The process x. needs to capture the 
behavior of the aggregated traffic generated by the modems connected to port i, as 
modulated by the Ethernet like contention resolution algorithm. 




> 



Figure 5: polling system model for the RMUX 

There are existing analytical results (Takagi, 1985) which can be extended to 
describe the behavior for this polling system, but under unrealistic stochastic 
assumptions. For example, one possible metric for evaluating the performance of 
the RMUX is the probability that an arrival to buffer i will occur during the 
interval [t (n),t (n)+Ds / (n)], for some n. But such a metric requires a valid traffic 
model describing the arrival process, which is difficult to obtain. 
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For now, we only present some first order approximate analysis for the proposed 
service disciplines to illustrate the general behavior of the service disciplines. The 
more detailed evaluation of the system is left to simulation. 

3.1 Round-Robin Service 

We model a RMUX with round-robin service using a polling system with a cyclic 
server, where after buffer i is visited, the server moves on to buffer (i+1) mod K. 
We also require that s^n) = 1, for all i, n. 

This is a system with symmetric load, so the probability that an arriving 
customer will not be blocked is 1/K. Fortunately, this does not translate 
automatically into a scaling of throughput by 1/K. Let the normalized occupation 
of upstream slots, or load, of a system without the RMUX be X, where X = 1 
implies that every upstream slot is occupied. Assuming traffic is divided evenly 
among the K ports, then the load on each port /, denoted by L , is X/K. However, 
this does not mean only X/K of the slots on each port will be occupied. With less 
contention on a path with fewer modems, the ratio of slots occupied is scaled up 
by a factor a, 1 < a < KJX , as illustrated in the shaded region of Figure 6 (a). So 
scaled, the portion of slots occupied per port is P = aXJK. 




Figure 6: illustrate effects of traffic divided into K ports 

The exact value of a depends on many details of the binary exponential back-off 
algorithm used in the contention resolution algorithm, but we can interpret the 
boundary cases intuitively. In a highly loaded system with large X , we have 
modems that failed to acquire requests slots previously utilizing the newly 
available request cell slots in the now lightly loaded return path, which results in a 
= KJX , and P = otX/K = 1 . Only in a very lightly loaded system do we have a = 1 , 
where no addition request cells are generated. In this case, with X = s = 0, we can 
write P = aX/K = e = X. The resulting P is plotted in Figure 6 (b), where we see 
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qualitatively how the portion of slots occupied is increased from A/K as a function 
of A. 

The round-robin RMUX scales the throughput of each port by allowing only 
1/Kth of the slots through, which gives us a throughput of aA/K 2 . Combining the K 
paths, the aggregate throughput is then T := aA/K. Using the value of a reasoned 
above, we can plot the approximate throughput T as a function of A in Figure 7 (a). 




Figure 7: illustrate approximate analysis of RMUX throughput 

3.2 Grouped Round-Robin Service 

We model a RMUX with a grouped round-robin service using a polling system 
with a cyclic server, a reduced number of buffers, and aggregated arrival 
processes. Let J be the number of aggregated arrival processes, where J < K. As 
before, we have s ( (n) = 1, for all i , n. 

For simplicity, we examine the case where J = K/2, where adjacent ports are 
grouped into pairs. With the number of input ports reduced by 1/2, the probability 
that an arriving request cell will not be blocked is now 2/K. Following a similar 
line of reasoning as above, we have 1 =< a =< K/2A, and the throughput T is 
bounded below by: A/J. Since the increased aggregated load on each port will push 
a towards K/2a, we argue that the throughput T will be approximately as shown 
by the top bold line in Figure 7 (b). 

We can therefore expect that grouping of ports, when allowed under noise and 
load conditions, will result in better performance. 

3.3 Load Dependent Acyclic Service 

We model a RMUX with a load dependent acyclic server using a deterministic 
fluid approximation polling system model with a rate dependent service discipline. 
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Instead of fixing s^n) based on a fixed loading of the port, the service discipline 
now adjusts s^n) dynamically as a function of time and the load on each port. 

Extending existing scheduling results (Clear-A-Fraction policy from (Perkins)), 
we propose the following policy where s^n) is based on estimated per port load. 

The policy we propose is called the Highest-Load-First (HLF) policy. Let tm be 
the time at which the server visits the mth buffer. Let e ( (m) be the estimated 
number of arrivals into buffer i during [t m , t w + s.(max)], where s^max) is a 
predetermined maximum number of slots each port can be allocated at any time. 
Such a value can be easily selected based on maximum latency of a HFC system, 
as well as based on minimum allocations to each port. With HLF, after the server 
has finished serving a buffer, the next buffer to be visited is buffer j, where j = 
maxfile^n)}. The number of slots serviced at port i is then set to s ( (n) = min{e t (n), 
s,(max)}. 

Earlier work (Perkins) has shown that such a policy is complete, that is, a set of 
s^max), i = 1,..., K, can be selected to ensure that every buffer will be visited in a 
given time period. A more generalized version of the HLF is the Fractional-Load 
(FL) policy, where the next buffer i is selected based on whether e ( (n) > s I. e ; (n), 
where 0 < s < 1, is satisfied. Assuming the estimated load equals the actual load, 
these scheduling polices have been shown to be stable under different initial 
conditions. The key potential advantage of these policies is that they are relatively 
easy to implement and have parameters that can be tuned to optimize performance. 

4 SIMULATION MODEL 

The analytical models above do not describe the behavior of the system 
quantitatively, nor do they capture the performance of the system under realistic 
traffic. By realistic traffic, we mean the traffic generated by the typical cable 
modem user. That traffic is TCP/IP based, and is mainly generated by World Wide 
Web (WWW) and FTP client-server application. Such traffic introduce yet another 
layer of flow control and also is influenced by human behavior not easily captured 
by analysis. For these reasons, we rely mostly on simulation. 

The simulator we use is a customized version of the ns simulator (Ns, 1998), a 
timed discrete event simulator. The simulator contains modules for simulating the 
UPSTREAMS HFC MAC protocol (Laubach, 1997), full TCP protocol behavior, 
and WWW browser client-server interactions as described in (Nichols, 1997). Our 
WWW client-server model, including file size distributions, is based on previous 
trace data (Cunha, 1995). 

Using this version of ns , we simulated the system using the following 
parameters: 

• Number of users/modems: Several user population sizes (ranging from 64 to 
2000) are used. 
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• HFC network: For simplicity, we assume a single upstream receive port at the 
headend and a single downstream channel. Depending on the RMUX 
configuration, four or eight upstream return paths feed into the single RMUX. 

• Quality of service: the HFC system simulated allows maximum and minimum 
rates to be specified for both up and downstream traffic. 

• Servers: For simplicity, we assume there is no limit to the number of 
concurrent WWW sessions using the HTTP protocol at the servers. 

In Figure 8, we illustrate the major components in our simulation model. Important 
to note is that this is a detailed model that takes into account the behavior of the 
TCP protocol, but also the interactions between the ATM protocol (e.g., 
Segmentation and Reassembly - SAR agent), the UPSTREAMS protocol 
(Laubach, 1997), TCP, and higher layer applications (e.g., WWW/FTP server and 
clients). 
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Figure 8: components in the simulation model 



During each WWW session between a client and the headend server, a sequence 
of short and long packets containing TCP SYN and SYN+ACK, URL, and 
document pages is exchanged. As described in more detail in (Nichols, 1997), each 
such WWW browser exchange is currently configured with the following 
parameters: 

• TCP window size of 6x1460 = 8760 bytes 

• Packet size of 64 bytes. 

• URL size = 3 cells 

• Page size distributed according to pareto distribution up to 1.5MB 

• Each page contain 4 in-line documents (e.g., images) whose size is distributed 
according to the pareto distribution up to 0.25MB 

• Initial delay of 0. 
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The resulting average page size is around 10KB, while the in-line document size is 
around 5KB. 

The duration of a typical simulation run is between 150 and 600 seconds in 
simulated time. This translates into real time depending on the number of events 
generated, which is proportional to the number of cable modems, load, and 
collision frequency. Using an Ultra Sparc workstation, a moderately loaded 
simulation run with 512 cable modems and 300 simulated seconds takes around 3 
hours to complete, while a simulation run with 16 cable modems takes about 15 
minutes. To eliminate the effects of transient behaviors, the initial 30 seconds of 
all simulations runs are excluded from the final tallied results. 

5 SIMULATION RESULTS 

Using the simulated WWW traffic above, we simulated the performance of a HFC 
system with a RMUX. 

Performance metrics 

The performance metric used are: 

• Latency: median and 90 percentile end-to-end delay for each modem. 

• Throughput: average throughput for each modem. 

Statistics about cell level, packet level, and page level performance are recorded, 
though only cell and packet level statistics are presented here. 

5.1 Effects of Load Distribution 

In Figure 9, we have plotted the upstream latency comparisons between varying 
number of cable modems. The number of modems was varied from 64 to 2,000, 
and clearly independent of the number of modems, the index of symmetry (IOS) as 
defined in equation (1) was the dominant factor in the latency performance. 
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Normalized average upstream delay vs. IOS 

x D«lay with no RMUX 




Figure 8: upstream latency as a function of Index of Symmetry 

From Figure 9, we see that increased IOS brings about dramatic improvement in 
the delay behavior. The curves also flatten out to around 2 for an IOS value of 
approximately greater than 0.4. Using equation (3), this threshold of 0.4 implies 
that one should place no more than about 1/4 (N/3.8, to be exact) of the cable 
modems on one input port in a 8x1 configured RMUX, or no more than N/2.2 
cable modems on one input port of a 4x1 RMUX. 

What this implies for throughput is a more complicated question to answer. This 
is because how increased upstream latency manifests itself in TCP throughput 
degradation depends on many factors. In (Cohen, 1998), the effects of upstream 
delay and error on TCP throughput was studied in detail. Although the study did 
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not take into account the behavior of the HFC MAC, it does illustrate the 
sensitivity of TCP throughput to the HFC upstream integrity. 



Normalized average upstream rate vs. IOS 




Figure 10: upstream throughput as a function of Index of Symmetry 

In Figure 10, we see that there is a knee in the curves at an IOS value of 0.2 - 
0.4, and the throughput of highly loaded systems are much closer to the throughput 
of a system without the RMUX. This was predicted by the simple analysis 
illustrated in Figure 7. It is not clear why the throughput appears to decrease as the 
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IOS approaches 1, but it seems that the decrease is accompanied with a decrease in 
the throughput variance. 

5.2 Effects of Service Discipline 

We present here a comparison of Round-Robin vs. Grouped Round Robin. 



Effects of Grouped Round-Robin on Upstream Delay 

Delay (msec) 




Figure 1 1 : latency improvements from grouped round-robin 
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In Figure 11, we plotted two sets of simulated latency results from WWW client- 
server session for a fixed number of modems. We see that by grouping lightly 
loaded ports into pairs, the resulting latency of TCP transfers is significantly 
improved. In the K=4 case, the upstream delay varied from 6 to 9 msec. For K=8, 
we see that the latency increased to the range of 10 to 13.5 msec, a substantial 
increase. 

This implies that the HFC system architect should identify lightly loaded ports 
and configure the RMUX to serve those ports concurrently. Doing so will greatly 
lower the average upstream latency of the entire system. This result also tells us 
that selecting the smallest K possible when configuring a RMUX will improve the 
system performance. 



6 CONCLUSIONS 

Focusing on the scheduling disciplines of a RMUX and how its performance 
depends on the load distribution of the upstream traffic, we have illustrated 
through simple analytical and simulation results that Return Path Multiplexing is a 
useful way of reducing headend port requirements (by a factor of K) in a HFC 
plant without significant performance degradation. Properly configured, simulation 
results show that greater than 80% throughput can be achieved. The RMUX’s 
multiplex service discipline only operates during the request phase on top of the 
contention resolution mechanism of the media access protocol (MAC), and does 
not affect the operation of the MAC during the data transfer phase. 

The simulation results presented in this paper used realistic simulated WWW 
traffic sources, and validated some of the simple analytical results. However, it is 
clear that a more detailed simulation study of the RMUX behavior is needed. 
Specifically, effects of return path multiplexing on TCP level downstream 
throughput and latency needs to be studied. 

We see that a simple cell level switch like the RMUX dramatically changes the 
architecture and provisioning policy of a cable modem system. Rather than having 
to accept a given physical layer topology and plant RF characteristic, the RMUX 
allows the cable operator to change a given topology without detrimental 
performance consequences typical of current passive RF combination methods. 
Whereas capacity planning and resource allocation of the HFC plant used to be 
issues constrained by the given RF condition of the return paths, now a device like 
the RMUX allows the cable operator to dynamically manage or isolate noisy return 
paths. 

However, we have also seen that it has an operating region that the HFC 
architect must be made aware of. Although the operating region identified in these 
results is large and can be easily engineered, an RMUX with an overly asymmetric 
load, as illustrated by the simulation results, will see performance degradation. In 
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order to properly configure the RMUX, a HFC system that allows the operator to 
provision the bandwidth allocation of each modem is required. This highlights the 
need for the ability to provision and enforce QoS in a HFC system. 

There are many aspect of the RMUX’s operation which we have not touched on. 
As with anything in the HFC plant, its proper performance depends primarily on 
the RF condition of the plant. Indeed, with favorable SNR in the HFC plant, it may 
be acceptable to service all the input ports of the RMUX concurrently during the 
request phase. This will allow the RMUX to operate without blocking, and simply 
let the contention resolution of the HFC protocol operate normally. Note that since 
the RMUX does not funnel noise during the data transport phase of the protocol’s 
operation, the resulting performance will be substantially better than using a 
simple passive RF combiner. The precise economic impact of a device like the 
RMUX is an important topic worthy of study, and investigations into other 
methods of achieving better RMUX performance are also ongoing. 

Disclaimer 

The results presented in this paper do not necessarily represent the performance or 
design of Com21’s commercial product that enables return path multiplexing. 
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Abstract 

In order to provide guaranteed QoS multiparty collaborative multimedia 
applications require reliable transmission of data. The multimedia applications 
can vary from distributed games, shared whiteboard to interactive video 
conferencing. These applications often involve a large number of participants and 
are interactive in nature with participants dynamically joining and leaving the 
applications[Sudan95]. In order to provide many-to-many interaction when the 
number of participants is large IP multicast is a very good option for 
communication. IP multicast provides scalability and efficient routing but does not 
provide the reliability these multimedia applications may require. Though a lot of 
research has been done on reliable multicast transport protocol, it really seems that 
the only way of doing a reliable multicast is to build it for a given purpose like 
conference control in multimedia conferencing. 

This paper compares some of the available multicast transport protocols and 
analyses the most suitable features and functionalities provided by these protocols 
for a facet of conference control, floor control. The goal is to find or design a 
reliable multicast transport protocol which would scale to tens or hundreds of 
participants scattered across the Internet and deliver the control messages reliably. 
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1 INTRODUCTION 

Conferences come in many shapes and sizes, but there are two models of 
conference control. These are known as Formal/tightly coupled conferencing and 
Informal/loosely coupled conferencing. Lightweight/informal sessions are 
multicast based multimedia conferences that lack explicit conference membership 
control and explicit conference control mechanisms. Typically a loosely coupled 
session consists of a number of many-to-many media streams supported using RTP 
and RTCP using IP multicast. Typically, the only conference control information 
that is provided during the course of a light-weight session is that distributed in the 
RTCP session information, i.e. an approximate membership list with some 
attributes per member. 

On the contrary, tightly coupled conferences where the media streams are flowing 
from mainly one-to-many or one-to-one basis, requires an explicit conference 
control mechanism. In a model like that a user interface is provided where the 
chair can choose to give a floor to one of the participants, so one person can talk, 
take control of the shared whiteboard or use the video channel at a time. 

The most conventional tightly coupled conferences are ITU based H.323[H.323] or 
T.120[T.120] standard conferencing which was initially designed to work over 
circuit switched networks like ISDN and the loosely coupled conferences are 
Mbone[MboneFAQ] based which are designed for IP multicast. Some features of 
the tightly coupled conferences like floor control have only recently been designed 
to work on IP with TCP over it or use UDP for other type of data. Therefore, the 
most suitable reliable IP multicast for tightly coupled conferences is a recent issue. 

IP Multicast provides a service model by which a group of senders and receivers 
can exchange data without the senders needing to know who the receivers are*, or 
the receivers needing to know in advance who the senders are. Hosts that have 
joined a multicast group will receive packets sent to that group. Therefore, this 
service model can lead to applications which will scale to hundreds/thousands or 
more receivers. Although, because of the limited bandwidth most applications like 
videoconferencing will deploy floor control to limit traffic from the group to a 
small number of concurrent sources. 

In order to support floor control either for a tightly coupled session (where 
reliability and ordering of the messages may get the highest priorities) or a loosely 
coupled session (where congestion control or retransmission strategy may be more 
complex and more critical than strict ordering), certain characteristics from a 
multicast protocol are required. The requirements for conference control from a 
transport protocol are: 

1 . Reliability and loss detection 

2. Retransmission strategy, queue management 

3. Scalability - source to many receivers, many sources to many receivers etc 



* 



Unless a higher level agreement has been done. 
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Ordering 

4. Scope of membership 

5. Congestion control 

6. Integrated security 

A lot of research is being done on reliable multicast transport protocols. This paper 
looks at some of the available protocols like SRM, MTP/SO, RMTP , RLC and 
PGM and compares them against the requirements of single facet of conference 
control Floor Control. The reason for choosing these particular protocol is that 
they provide a lot of the fuctionalities required by a conference control mechanism. 
However, there may well be other protocols available now or may well be in the 
design phase which may serve the same purposes. 

Section 2.0 is the background of some of the available reliable multicast protocols, 
section 3 analyses floor control and its requirements in general, the following 
section looks at the limitation of floor control, section 6.0 highlights the 
limitations of some of these multicast transport protocols in the light of floor 
control requirements. The last sections (7.0 and 8.0) describes an ideal reliable IP 
multicast with certain characteristics and which of the available protocols provide 
some of these functionalities. 



2 BACKGROUND 

Loss detection and retransmission strategy are two important aspects in the design 
of any reliable protocol. In a reliable transport protocol a recipient can (within 
bounded time) find out when it is failing or being partitioned from active senders. 
A sender is assured (with sufficient probability) that all its messages reach 
within bounded time. 

In a traditional point-to-point reliable protocol such as TCP, positive 
acknowledgements are used to detect loss and the sender is responsible for 
retransmission of the packet. Using TCP one can provide HTTP Web traffic, FTP 
file transfers, and e-mail. All TCP traffic is unicast, that is it has one source and 
one destination. The nature of data can be either bulk data transfer where all data 
is sent one way and then the sender waits for a response or interactive where as 
soon as each data unit is sent acknowledgement has to be returned. The transmitter 
sends out a window’s worth of data before requiring an acknowledgement. 

It is harder to transfer data ’’reliably” from source(s) to R receivers (where R can be 
10's to 100,000 or more), because multicast protocols interact with multiple parties 
simultaneously and so involve a higher number of links. Therefore, the likelihood 
is greater that some of the paths in the source’s multicast tree are unstable at any 
time. In addition, the instability in any portion of the multicast tree may affect 
many members of the group because of the collaborative adaptive algorithms 
used[Floyd98]. In particular, it is difficult to build a generic reliable transport 
protocol for multicast, much as TCP is a generic transport protocol for Unicast. 
Reliable multicast is a case where "one size fits all” does not work at all. 
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Applications often have very different reliability and latency requirements, state 
management styles, error recovery and group management mechanisms. A reliable 
multicast transport protocol that meets the worst-case requirements is unlikely to 
be efficient and scalable for many application requirements[Zhang97]. 

In a teleconferencing environment, a desirable robustness property is the ability to 
continue operating within partitions should the group become partitioned. 
Ultimately, the applications that use the multicast transport platform should be the 
ones to decide when the situation has deteriorated to a point where continuing is 
meaningless. 




Diagram 1 : A basic diagram of a sender initiated 
Protocol 



Diagram2: Receiver initiated 
Protocol 



The design of a reliable IP multicast can be based on either a tree-based, a ring 
based or an ACK/NACK i.e. acknowledgement structure. 

In the following subsection we provide an overview for some of the reliable 
multicast transport protocols: 

2.1 SRM 

Scalable reliable multicast (SRM) has been embedded into an Internet 
collaborative whiteboard application called wb. In SRM, whenever a receiver 
detects a packet loss, it multicasts a NACK packet to the entire group. Upon 
receiving the NACK packet, any member holding the desired packet can multicast 
it to the group. To avoid duplicate NACK and repair packets, a suppression 
algorithm is used in which a node sets a random timer before multicasting a NACK 
or repair packet. The messages specify a time-stamp used by the receivers to 
estimate the delay from the source, and the highest sequence number generated by 
the node as a source. SRM's implementation requires that every node stores all 
packets or that the application layer stores all relevant data. 

One of the problems with SRM is that this algorithm will end up consuming a lot 
of bandwidth when there is little correlation of losses among receivers. For 
example, in a group of 1000 receivers, when only one receiver loses a packet, all 
1000 receivers need to process the multicast NACK and repair packets. This 
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causes significant overhead. Also if one set of hosts in particular requires a packet, 
it is not desirable to multicast the packet to all the possible groups. One possible 
method of improving SRM's efficiency is to use localised recovery. The idea is to 
multicast NACKs and repairs locally to a limited area instead of to the whole 
group. Using the TTL (Time to Live field in the IP packet header is one possible 
way to implement scope control. 

2.2 MTP/SO 

Multicast Transport protocol or MTP provides an atomic and reliable transmission 
of messages. MTP/SO provides global ordering where messages are assigned to 
different streams. Therefore the delay caused by global ordering (for example 
when a short message is preceded by a very long one) is eliminated. MTP/SO 
proposes self-organisation of the members of a group into local regions for 
addressing the NACK implosion problem. MTP/SO provides a rate controlled 
transmission of user data. There are three main groups of members within a group: 
co-ordinator, repeaters and normal members. To provide maximum throughput the 
co-ordinator can send and receive retransmission, whereas if it is a type of a 
member who is just 'listen only' capable, the only packet type they can send to the 
group is unreliable multicast datagrams. 

The rate controlled transmission of user data is very useful for floor control. If 
only few users are capable of holding the floor then there is only little point of 
giving all the other 10,000 receivers the capability of asking for retransmission of 
floor request. Although a lot of the functionalities of this protocol can be used for 
conference control (which is discussed in the section) purposes, the implementation 
of MTP/SO is in very early stage yet. 




Diagram 3: A basic diagram of a tree based protocol 
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2.3 RMTP (Globalcast Communication) 

Reliable Multicast Transport Protocol (RMTP) organises all the nodes into a tree 
structure. The receiving nodes are always at the bottom of the tree. Ideally the 
senders are at the top. The sender transmits messages using IP multicast, after a 
message is transmitted the sender will not release the memory until it receives a 
positive acknowledgement from the group. The receivers do not send 
acknowledgement directly to the top node(sender), but send hierarchical 
acknowledgements (HACKs). A receiver transmits a HACK to their parent in the 
tree structure. The parent gathers all HACKs from its children and sends a HACK 
to its parent node one step higher in the tree. The HACKS are propagated upward 
to the top of the tree and the sender is eventually notified. This design allows 
dissemination of messages to a large number of receivers without causing ACK 
implosion. 

If there are lots of listeners and two or three speakers in a conference then this is a 
good architecture. As diagram below represents a floor control scenario in RMTP 
type of architecture. 




Speaker 



(Collector of questions) 




< 4 — - 



Diagram 4 : One speaker and few listeners in a classroom 






Diagram 5: Tree based architecture (e.g. RMTP) represents the classroom type 
of conferences 
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2.4 RLC and RMDP 

In Reliable Multicast data Distribution protocol (RMDP), the problem of insuring 
reliable data delivery to large groups, and adaptability to heterogeneous clients is 
solved by Forward Error Correction (FEC) technique based on erasure 
codes[Vicisano98]. 

The basic principle behind the use of erasure codes is that the original source data, 
in the form of a sequence of k packets, along with additional n redundant packets , 
are transmitted by the sender, and the redundant data can be used to recover lost 
source data at the receivers. A receiver can reconstruct the original source data 
once it receives a sufficient number of (k out of n) packets. The main benefit of 
this approach is that different receivers can recover from different lost packets 
using the same redundant data. In principle, this idea can greatly reduce the 
number of retransmissions, as a single retransmission of redundant data can 
potentially benefit many receivers simultaneously. 

In order to deal with congestion control , the ultimate problem of one-to-many data 
transfer protocols on top of the IP multicast, RLC (receiver driven layered 
congestion control)is proposed by the same authors. This mechanism is designed 
for a transmitter sending data to many receivers on the Mbone[Levine98]. In 
Unicast communications, the sender takes part to congestion control by changing 
its sending rate according to the congestion signal that it receives. In multicast 
communications, this approach would be problematic, since different groups of 
receivers with different requirements exist, and adapting to the need of one set of 
receivers will be unfair to the rest. The effect of congestion control is decided by 
the receivers. It gives receivers the possibility to modulate the receive rate by 
joining/leaving layers. 

Though the above mechanisms are very good solution for bulk data transfer, it does 
not really satisfy the needs for floor control. For example, in floor control 
mechanism the identity of the participants are quite crucial. Combination of RLC 
+ RMDP is not really appropriate for floor control purposes. 

2.5 PGM 

Pretty good multicast (PGM) is a reliable transport protocol for applications that 
require ordered, duplicate free, multicast data delivery from multiple sources to 
multiple receivers[speakman98]. When a receiver detects a missing packet, it 
repeatedly unicasts a NAK to the last-hop PGM network element on the 
distribution tree from the source. sr repeats this NAK until it receives a 

NAK confirmation (NCF) multicast to ©roup from that PGM network element. 

The network element repeatedly forwards the NAK to the upstream PGM network 
element on the reverse of the distribution path from the source of the original data 
packet until it also receives an NCF from that network element. Finally, the source 
itself receives and confirms the NAK by multicasting an NCF to the group. 

PGM is not intended for use with applications that depend either upon 
acknowledged delivery to a known group of recipients, or upon total ordering 
amongst multiple sources. For floor control, these two functionalities are quite 
crucial, therefore PGM is not the best suited protocol for floor control . PGM is 
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better suited for applications in which members may join and leave at any time, 
and that are either insensitive to unrecoverable data packet loss or are prepared to 
resort to application recovery in the event. 

2.6 Functional Criteria 

The table below is a comparison of several multicast transport protocols based on 
functions that are relevant for floor control. 
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Reliability 

Semantics 
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of regions, 
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HACK 
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size 


RLC + 
RMDP 
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No 


No 


No ACK/ 
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error 
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on file 
size 


PGM 


Reliable 


No 


Local 

retrans- 

mitters 


No 


Bread 

crumb 


1 packet 


MTP/SO 


Reliable, 

totally 

ordered, 

atomic 

delivery 


Through 

different 

streams 


Master 

Repeater 

Consumer 


Known 


NAK (?) 


? 


NTE 


Reliable 


No 


Distributed 


Via 

Session 

packets 


Trigg- 

ered 

NAKs 

with 

randomi- 
sation + 
FEC 


1 ADU = 
1 packet 





73 



3 FLOOR CONTROL AND ITS REQUIREMENTS 

Floor control in CSCW is a metaphor for ’’assigning the floor to a speaker", which 
is not only applicable to voice channels, but more generally to any kind of sharable 
resource within conferencing and collaboration environments[Dommel95]. A floor 
is an individual temporary access or manipulation permission for a specific shared 
resource, e.g., a telepointer or voice-channel, allowing for concurrent and conflict- 
free resource access by several conferees. 

For example, a floor requester in a meeting room would be a person who raises 
his/her hand up to ask a question. It is up to the chair to grant the floor to the 
requester. The session parameter entails the number of collaborators, and their role 
(chair, listener, a floor holder), determining their capabilities. Also, the 
interconnectivity (1-1, 1-to many, many-to-many), sharing distribution range(local, 
wide area), and link types (bi or unidirectional) are important too. 

There are several types of floor control policy available for use by collaborative 
environments like explicit release, free floor, round-robin scheme etc.[Greenburg 
91]. Whatever the scheme is, for applications to scale beyond a few participants, 
all communication must be multicast. Some research has been carried out to 
support Interactive collaboration application like TMTP[Sudan95] for data , 
STORM[Xu97] for audio and video and SRM[Floyd95] for wb. However, the 
nature of floor control is somehow different to these interactive applications. For 
example, the volume of data i.e. floor control messages are lot less than audio or 
video or whiteboard associated data, the timing of requesting/granting floor control 
can be very specific (for example, when the chair/speaker addresses the audience 
and asks for questions, a lot of listeners are going to request the floor but before 
that traffic may be lot less), ordering of data is more crucial factor than 
audio/video(for fairness, or applications like when customers are bidding for share) 
etc. 

Typically traffic control for floor requests would be done in low level per source. 
An example of sudden flood of traffic would be "Flash Call" problem in POTS. 
Flash call would occur when a televoting system is taking place, where the viewers 
call a telephone number provided by a particular program, to give their opinion. 
The first method to avoid this sort of problem is the undeterministic approach, 
where after certain calls being taken by the network, users would hear an 
equipment engaged tone. This would stop the network being flooded by too many 
calls. Other approach is the deterministic approach, where the telephone company 
would be warned in a day advance, by the program organisers. So the telephone 
company can provide enough resources for that sort of service, and the cost would 
be higher. 

On a data network, similar situation can take place too. There are certain traffic 
problems which would only apply to floor control and conference control type of 
applications. A reliable IP multicast protocol has to include certain features which 
would account for: 
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Congestion control - The volume of traffic will increase at certain points of time. 
The reliable IP multicast has to cope with sudden burst of traffic. Many sessions 
have precise starting times, when most of the members of a conference joining the 
session, or multimedia tools such as vat and vie can be programmed to join a 
session at the instant of its inception. This will cause a flood of traffic. 

Ordering - To be fair to all the floor requesters the IP multicast has to have a 
mechanism for strict ordering. Let us consider if a receiver A requested for a floor 
who is 120 ms from the server/chairman. Receiver B requests for a floor who 
happens to be 100 ms away from the receiver/chairman after 10ms. Therefore, B’s 
request will get to the server/chair before A’s request, which is unfair for A. In this 
particular case the timing difference is so small that it may not really matter, but 
the difference can be in seconds rather than milliseconds. 

Reliability - To provide good services, reliability and the retransmission strategy is 
quite important. Assume the scenario, where a floor request is multicast by a 
receiver A, receiver B didn’t receive the message after time t. Receiver B now bids 
for the floor, without knowing the floor requester is Receiver A. Imagine there is 
a policy in this conference that if someone has requested a floor, the next person is 
not allowed to bid for the floor within next t' seconds. Now somehow in this 
scenario, someone has to inform the receiver B that receiver A has asked for the 
floor, and he may not request/being granted the floor. In protocols like TMTP the 
domain manager retransmit the data, whereas in SRM the nearest receiver to B will 
transmit the data. 

Member Classes - There can be different types of members in a conference. As 
discussed in section 2.2, the rate controlled transmission of user data is very useful 
for floor control. For limited bandwidth, this is a way to limit number of 
concurrent users on the network. For example, one type of member will be not 
just a member but also a potential co-ordinator and repeater. Another type of 
members will be just normal members, the last type of member will unreliable 
receiver who will not ask for retransmission. If the members are categorised like 
that then the job of the application programmer is made a lot easier. A model like 
MTP/SO proposes to meet this requirement. 

4 LIMITATIONS OF FLOOR CONTROL 

A lot of the multicast transport protocols like SRM, RMTP, MTP/SO will meet 
some of the requirements for floor control. Certain protocols can be customised or 
adopted to meet some of the requirements. However, there are some limitations of 
a floor control mechanism itself because of the nature of its behaviour. The 
principle difficulty is in achieving scalability to large group sizes. In a conference, 
where all members have access to the ability to request (and grant) the floor, it is 
necessary for all participants to know who the other participants are. Otherwise, 
none can see a global reason for giving someone the floor. 

If the access bandwidth is small compared to network backbone bandwidth, at 
time t, there may be 1000 receivers in the system, however using RTCP the report 
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of the participants may show only first 20 participants*. To account for congestion 
control a solution has been suggested in timer reconsideration for enhanced RTP 
scalability [Rosenberg98]. In a multimedia session which is using RTP/RTCP for 
transporting audio and video where RTCP rate is 1 kb/s. If all RTCP packets are 
1 kb, packets should be sent at a total rate of one per second. Under steady state 
conditions, if there are 100 group members, each member will send a packet once 
every 100 seconds. However, if 100 group members all join the session at about 
the same time, each thinks they are initially the only group member and sends a 
packet at a rate of 1 per second, causing a flood of 100 packets per second or 100 
kb/s, into the group. 

So the effect of timer reconsideration algorithm is to reduce the initial flood of 
packets, which occur when a number of users simultaneously join the group. A 
participant P who wants to join at time t will determine the group size and it will 
transmit at time t', where t’> t. So if a session has to start at 10:00 am, packets will 
be sent at 10:01 am, 10:02 am and so on. Therefore, at time t, the report showing 
the number of participants at 10:00 am will not be correct. 

So the underlying technology has to support users to join a session at t" where t” < 
t. In other words, if the session is programmed to be broadcast at 10:00 am, users 
have to join the session from 9:55 am. That requires modification of connection 
charges to include the traffic flow pre session. 

If each participant sends messages at the rate of K/N per second, where K is the 
fraction of total capacity allowed for the RTCP messages, the following can be 
derived: 

For audio, we might choose to have 1 speaker and therefore K is the capacity of 
that 1 flow. Typically RTCP messages might be limited to 5% of the flow, so for 
20 packets per second, we would be allowed 1 message per second. Over 5 
minutes, this would allow N to reach 300. 

For video, we may choose to allow either one video flow or several, to save 
bandwidth, we probably choose the current speaker's video channel, we might be 
sending 100 packets per second from each and every source, which allows for K=5, 
or N to reach 1500 participants after 5 minutes. 



5 OBSERVATIONS 

Many protocols are proposed and implemented: 

Protocols differ widely in design 

Logical structure of communication pathways (ring versus tree versus none) 



If the reliable protocol is distributed (e.g. in SRM/NTE)i.e. the participants can 
only see the local information straight away and overall statistics is an option, then 
this problem can be eliminated to an extent. 
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Group membership mechanisms and assumptions 
Receiver-reliable versus sender reliable 
ACK/NAK and FEC 

Based on floor control requirements from a reliable IP multicast (as discussed in 
section 3.0) SRM will be one of the most suitable transport protocols if all the 
participants are multicast capable. Because SRM represents a simple and robust 
approach for large-scale recovery based on persistent state, suppression of 
duplicate NACKs and repairs, and global retransmissions. The messages specify a 
time-stamp used by the receivers to estimate the delay from the source, which 
causes global ordering. Also the model of this algorithm is distributed so that the 
participants list will not take too long to update. However, if the number of 
participants is very large, the convergence time will grow exponentially and SRM 
will not be the best suited algorithm. 

If some of the participants in a video conference is Unicast only a tree based 
structure for IP multicast like RMTP or MTP/SO will be quite suitable too. In the 
hierarchical system, one parent node can have several Unicast only child nodes 
underneath it and it can Unicast the data to these child nodes. In this model the 
participants list can be viewed by the parent node as shown in diagram 5. 

6 LIMITATIONS OF SRM AND MTP/SO 

SRM is very efficient for retransmitting the lost packet whereas MTP is customised 
to take care of different classes of members in a conference. None of these 
protocols cater for congestion or flood of packets which will be caused by a 
session starting or question time for a conference for example. This sort of 
problem is solved RMDP or the approach taken by RTP timer reconsideration. 



7 IDEAL PROTOCOL FOR CONFERENCE CONTROL 

After discussing the pros and cons of the different protocols it seems that a reliable 
multicast protocol has to be able to provide: 

Congestion control: Cope with sudden burst of traffic. If number of receivers are 
small (for example, if it is up to 100 receivers) a buffer can be provided to store the 
requests. Otherwise, a mechanism has to be provided where pre session traffic 
flow is allowed. RTP timer reconsideration is an example to deal with congestion 
control. Also if a user who just got the floor waits a certain amount of time before 
asking for the floor again will help the implosion as well. 

Ordering: The point about floor control is that requestors should get a fair chance 
at getting the floor. The problem with the reliable multicast transport protocols is 
that to scale, they use techniques like SRM (random timer). What is required is a 
deterministic (round robin) timers for people requesting the floor at the same time. 
None of the transport protocols include this feature. So if a participant asked for 
the floor or got the floor last time, then they have to go after everyone else - i.e. 
that user/participant has to wait before asking for the floor again. 
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Reliability : Fastest way to retransmit lost/damaged packets. Not just the source, 
any one holding the packet will transmit the packet to the receiver require that 
damaged packet. SRM’s retransmission strategy provides that. 

Distributed control: As discussed in section 4.0, because convergence time 
increases as the number of users increase, there is a limit on the size of conference 
of known participants. A hierarchical system with just the knowledge of certain 
group or certain local users will be a possible solution. RMTP or STORM can 
provide that sort of architecture. 

Simple: Multicast the status of the floor holders, a request is multicast to the group 
too. Any IP multicast can provide this function. 

Other: May be able to cope with Unicast only receivers too. Security is provided 
for alternative approaches. 



8 CONCLUSION 

There are protocols like RMTP/STORM, NTE and SRM which are designed for 
specific applications. SRM is a robust protocol which meets a lot of the 
requirements for conference control. MTP and RMTP meet certain criterias too . 
However, these protocols need a level of customisation or a level of adaptation to 
be ideal protocol for conference control. This paper also looks at the limitation of 
these protocols and the limitation of floor control to achieve scalability. 
Therefore, if a reliable multicast has to be designed to meet the requirements of 
floor control it can be quite complicated to cater for ordering, congestion control, 
pre traffic flow etc. In order to keep it simple, we need a mechanism where the 
status of the floor holders is multicast in every few seconds to the group. If a user 
wish to bid for the floor, the request is multicast too. The stabilising 
time/converging time grows as the number of participants grow normally, so a 
hierarchical system will be a better solution. It is also required to provide a 
distributed model for retransmission and keep the status of receivers up to date. 
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Abstract 

The current work argues that from a security perspective there is much to be 
gained by employing a “secured” IP multicast at the Network layer to support the 
formation and management of secure conferences at the Application layer. A 
secured IP multicast — with group authentication and confidentiality -- already 
achieves a reasonable level of security, and therefore fulfils a large part of the basic 
requirements of secure conferencing. If host-to-host authentication and 
confidentiality has been achieved through an N-to-N multicast that has been 
secured, then to a large extent the basic security needs of conferencing has been 
satisfied. What remains would be for the other conference-specific security 
requirements to be satisfied using methods which are particular to a given 
conference scheme, such as cheater detection/identification methods based on 
cryptographic techniques. In the current work we propose an architecture called the 
Multicast/Conference Security Architecture (MCSA) to facilitate the use of (a 
secured) IP multicast at the Network layer for establishing (a secured) conference 
at the Application layer. 



Keywords 

Secure multicast, secure conferences, cryptography, routing protocols, key 
management 

1 INTRODUCTION 

The issue of group-oriented security has been a topic of interest in the field of 
computer and network security for the last two decades. This has been facilitated 
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by a number of factors, including the growth of the Internet, the development of 
cryptosystems (in particular, public key cryptosystems), and the development of 
hardware and software for desktop level computing. Increasingly users are using 
the Internet not only for exchanging messages, but also for more complex 
interactions and as a forum for decision-making. 

In this paper we attempt to bring together the main areas of research and 
development related to group-oriented security. The first area consists of solutions 
and schemes which are known collectively under group-oriented cryptography. 
These cover, among others, conference key distribution systems (eg. Ingemarsson, 
Tang & Wong 1982, Koyama & Ohta 1987, Steiner, Tsudik & Waidner 1996), 
digital multisignature schemes and secret sharing schemes (Simmons 1992). Most 
of the conference key distribution schemes that have been proposed require 
extensive cryptographic operations and have been designed with the Application 
layer in mind. These employ user authentication techniques either separately or 
integrated into the conference key distribution scheme. Some rely on the use of 
smartcards (eg. Koyama et al. 1987) as a user authentication technique and as a 
medium to store the conference security parameters. 

The second area of development closely related to group-oriented communications 
is multicast, more specifically IP multicast. Here, group security is to be achieved 
through the distribution and management of cryptographic keys at the Network 
layer, using approaches which we broadly term group key management (GKM) 
protocols (Mittra 1997, Harney & Muckenhim 1997, Ballardie 1996, Harkins & 
Doraswamy 1997). 

In this paper we argue that from a security perspective there is much to be gained 
by employing a “secured” IP multicast at the Network layer to support the 
formation and management of secure conferences at the Application layer. A 
secured IP multicast - with group authentication and confidentiality - already 
achieves a reasonable level of security, and therefore fulfils a large part of the basic 
requirements of secure conferencing (see Section 2.1). If host-to-host 
authentication and confidentiality has been achieved through an N-to-N multicast 
that has been secured, then to a large extent the basic security needs of 
conferencing has been satisfied. What remains would be for the other conference- 
specific security requirements to be satisfied using methods which are particular to 
a given conference scheme, such as cheater detection/identification methods based 
on cryptographic techniques. Other benefits that would emerge include the 
increased efficiency of the conference formation and the possibility of the 
simplification of the conference key distribution schemes being employed. 

In the current work we propose an architecture to facilitate the use of (a secured) IP 
multicast at the Network layer for establishing (a secured) conference at the 
Application layer. The purpose of the architecture is to introduce components at 
the Session layer that will coordinate the creation of multicast sessions, the 
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obtaining of group-keys for the groups, and the “mapping” of the conference 
instance to the multicast instance, and other tasks. The architecture must also 
allow different secure conference key distribution schemes to be used, independent 
of the multicast protocol and group key management protocol at the Network layer. 

In the remainder of the paper, we define a “conference” as the N-to-N 
communications occurring at the Application layer (or originated from events at 
the Application layer). Similarly, we define “multicast” or “IP multicast” as the N- 
to-N communications occurring at the Network layer. We use the term multicast 
“session” to include all multicast “groups” related to a session. Thus, for example, 
a session may have an audio group and a video group (as in the Mbone), and both 
would be considered together when dealing with security. Consistent with this 
convention, we thus distinguish between a “conference key” for a conference and a 
“group key” for a multicast instance. For simplicity of the discussions, in either of 
the two cases we assume that the key is a private key (ie. symmetric 
cryptosystems) and we assume that the key is used to encipher traffic among the 
parties involved. Other variations on the type of key can certainly be employed 
depending on the needs of the circumstances. 

In Section 2 some background is provided, looking from the perspectives of both 
the Application layer and the Network layer. This is followed by the proposed 
architecture in Section 3. Section 4 closes the paper with some remarks and 
conclusions. 



2 PERSPECTIVES ON CONFERENCES AND MULTICAST 

Research and developments in the area of group-oriented security in the past two 
decades has largely focused on two major directions which can be viewed from the 
Application-layer perspective and from the Network-layer perspective , reflecting 
the two main locations in the communications software architecture where 
solutions have been suggested. 



2.1 The Application Layer Perspective 

Much of the research efforts carried-out on the Application layer revolved around 
the use of cryptographic techniques to achieve a fair and secure method to 
distribute cryptographic keys to participants of a conference. The key belonging to 
the conference is then to be used to secure (eg. encrypt/decrypt and/or authenticate) 
the messages exchanged between the conference participants. To these schemes we 
apply the broad term Conference Key Generation and Distribution System 
(CKGDS), which may or may not incorporate user-authentication. The conference 
membership is usually determined by the piece of secret information carried by (or 
assigned to) the users. Joining a conference can be signified by the user indirectly 
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applying the portion of his/her secret information towards the conference-key 
computation. (For example, the user may apply his/her secret key to a circulating 
irreversible “token” that accumulates the keys). CKGDSs are often coordinated by 
a Conference Coordinator (eg. the participant that initiated the conference or a 
trusted third party). Other approaches employ a trusted third party to fully generate 
and deliver the conference key. 

There are a number of security requirements for secure conferences. Some 
requirements that are of interest to the current work include: 

• Source identity and source authentication : a member of the conference must 
be able to verify the source (identity of the sender) and authenticate the 
message from the sender. This assumes that user authentication has been 
enforced. 

• Data confidentiality : encryption of data for confidentiality must be available 
for all traffic should the conference members decide to use such means. 

• Participation non-repudiation : depending on the nature of the conference, 
members of the conference must not be able to repudiate their presence in a 
particular conference. This is crucial when the conference arrives at a decision 
which is binding to all members who are present, and who should not, 
therefore, be able to deny this fact at some future time. 

• Sender/receiver non-repudiation : both senders and receivers in a conference 
must not be able to repudiate the fact that they sent or received messages 
respectively. 

• Cheater detection and identification : any members (or intruders) that cheat 
must be able to be detected and identified. There is a range of behaviours that 
can be considered as cheating. A typical example would be a member that 
wishes to participate in decision-making a conference, but who would like to 
avoid the commitments resolved in that conference, by way of not submitting 
the correct information (eg. security parameters) at the creation of the 
conference. Other examples include masquerading as other members of the 
conference. 

• Joining conferences securely : new parties that are permitted to join a 
conference (eg. by policy) should be able to do so in a manner that does not 
compromise or reduce the security of the existing conference participants. 

• Secure ejection of a member: when a conference wishes to eject a member, a 
secure method must be used, both to arrive at the consensus with regards to 
the ejection decision and to actually carry-out the ejection. The ejection of a 
member should not in any way effect the security of the remaining members. 

Although outside the scope of the current work, another issue that is commonly 
related to conferencing is the anonymity of the participants of the conference. 
Should it be required, anonymity can be achieved at the conference level using the 
appropriate secure conferencing scheme where the identity of the actual user is 
hidden using a pseudonym (Chaum 1981), and where the identity and the 
pseudonym is linked through other pre-conference means (such as smartcards, for 




83 



example, in the scheme of Koyama et al. (1987)). Although in most circumstances 
the end-users of the conference service are humans who require the identity of the 
peers to be known, there are circumstances where pseudonyms can be used and be 
made legally binding should the conference require it. 



2.2 The Network Layer Perspective 

In contrast to the efforts based on extensive cryptographic operations to satisfy the 
security requirements of conferencing, the developments that strive to find a 
solution at the Network layer have originated from the need and desire to make IP- 
multicast itself secure, independent of the applications that may employ it. These 
have taken the form of group key management (GKM) protocols, and have largely 
focused on providing practical and implementable solutions based on the 
realization that the network has true physical limitations and that the user-response 
quality represents a major factor in the design. The GKM approach is driven by the 
realization that although some security mechanisms have been introduced into 
applications that make use of IP-multicast, in itself IP-multicast does not have any 
authentication and/or confidentiality features. 

A number of GKM protocols have been proposed (eg. Mittra (1997), Harney et al. 
(1997), Ballardie (1996), Harkins et al. (1997)). Here, the idea is that the GKM 
protocol (running at each relevant host) distributes keys to all hosts of a multicast 
session. Each group is assigned a unique key, and all traffic within the group 
would then be encrypted using the group-key. Only group-authentication is thus 
afforded, not sender-authentication. 

One of the fundamental aims of these GKM protocols -- which is an aim that we 
also subscribe to — is to separate the group key management for multicast from the 
multicast service itself, thereby providing independence of the GKM from the 
multicast protocol that happens to be in use (eg. CBT of Ballardie, Francis & 
Crowcroft (1993), MOSPF of Moy (1994), and others). Although there have been a 
number of proposals for a group key management protocol, none at the moment 
are being used on a wide scale on the Internet. 

Other developments towards securing communications at the Network layer have 
emerged in the form of IPSEC (Atkinson 1995) and its related security 
technologies such as the Internet Security Association and Key Management 
Protocol (ISAKMP) (Maughan & Schertler 1997) and the Internet Key Exchange 
(IKE) (Harkins & Carrel 1998). IPSEC and IKE are unicast technologies that aim 
to secure communications between a sender and a receiver at the Network layer. 
This is achieved through the creation of security associations (SA) — linked to a 
Security Parameters Index (SPI) - which identifies all the parameters (ie. 
encryption algorithm, algorithm mode, encryption keys, etc) required for the two 
parties to securely communicate. However, IPSEC is designed for unicast between 
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one sender and one receiver, and therefore is not fully satisfactory for the security 
needs of multicast. In particular, the current version of IPSEC does not provide for 
the creation of a single security association for an entire multicast group. 



2.3 Classification of Multicasts and Conferences 

In this section we broadly classify groups from the perspective of the Network 
layer and the Application layer, realizing in full that the various attributes of 
groups have different interpretations in different circumstances. It follows that 
there is no one solution that can solve the security needs of both multicast and 
conference. Hence our approach of an architecture which can support the various 
parameters of multicast and conferences. 

2. 3. 1 Network Layer Perspective 

Receiver view: 

Open multicast 

• Any host is free to join a group and to receive the group's traffic. 

• Joining does not require explicit permission. 

• The receiver host requests its local subnet router to join a given group (eg. 
using IGMP or similar group-membership protocol). 

Closed multicast 

• Explicit permission must be requested to join a group (eg. to the initiator of 
the group). 

• Mechanisms must be employed to enforce permissions (for example, 
encryption of traffic within the group). 

Sender view: 

Open multicast 

• Any host can initiate/create a new group 1 . 

• Any host can send to a group without being a member of the group 2 . 

• Any host can send to any group. 



1 Although a host can create a group, technically speaking the host does not have to be a sender or 
receiver in the group. For simplicity, we assume that a host that creates a group is at least a sender. 

This may or may not be desirable, as it may allow denial-of-service attacks to the group. However, 
the notion of an authorized non-member sending to a group may have its uses and benefits. 
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Closed multicast 

• Not everyone can create a group, and it is desirable to have mechanisms to 
limit who can create/initiate new groups. 

• Only members of a group can send to the group. 

• Traffic may be enciphered as a way to enforce explicit permissions. 



2.3.2 Application Layer Perspective 

Receiver (Sender) view: 

Open conference 

• Any user can initiate a conference. 

• Any user can join or receive (send) provided he or she reveals his/her 
verifiable identity to the conference Coordinator and other conference 
participants. 

• Presence at the conference does not bind a participant to the outcomes of the 
conference. 

• The user need only submit parameters that are sufficient for identification 
purposes. 

Closed conference 

• Any user can initiate a conference. 

• By definition, only limited individuals can receive from (send to) the 
conference. 

• An invitee must make explicit request to the conference Coordinator. 

• The identity of an invitee must be verified before he or she can receive (send). 

• The invitee must submit cryptographic parameters as part of the conference 
key generation and distribution scheme. 

• Participation in the conference must be provable through the key generation 
and distribution scheme. 

• A participant must not be able to repudiate his/her presence and participation 
in the conference. 

• The conference outcome is legally binding. 



3 MULTICAST SUPPORT FOR CONFERENCING 

In the current work we propose an architecture to facilitate the use of (a secured) IP 
multicast at the Network layer for establishing (a secured) conference at the 
Application layer. The purpose of the architecture is to introduce components at 
the Session layer that will coordinate, among others, the creation of multicast 
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groups, the obtaining of group-keys for the groups, and the “mapping” of the 
conference instance to the multicast instance. The aim is also to provide protocol 
independence, in the sense that different secure conferencing schemes at the 
Application layer can be used independent of the multicast protocol and group key 
management protocol at the Network layer. 

One important assumption in the proposed architecture is that the underlying 
multicast service model (and its associated routing structures) allows a host who is 
part of a group to also transmit messages to the group. This can be achieved by 
using a multicast protocol that allows a receiver-host to also transmit messages to 
the group. Thus, in effect the underlying multicast protocol performs the N-to-N 
multicast. Most IP multicast protocols today allow this to occur. 

In the case that the multicast protocol is limited to providing unidirectional 
transmission, then an overlay of N instances of these 1-to-N multicasts must be 
created and managed. That is, N separate multicast “trees” (emanating from a 
source to the receivers) must be created. This overlay will result in the creation of a 
total of N 2 security associations. Due to its large resource consumption, we will 
not consider this second approach any further. 



3.1 Multicast Conferencing Security Architecture 

The current work approaches the issue of providing support for conferences using 
(a secured) multicast by introducing the Multicast/Conferencing Security 
Architecture (MCSA) that provide selectable components which reside at the 
Session layer. The MCSA components together act as an intermediary between the 
conference application program at the Application layer and the multicast-related 
protocols (host-side or client-side) — such as the Group Management Protocol 
(GMP) and the Group-Key Management (GKM) protocol -- at the Network layer 
(Figure 1). Here we use the general term GMP to denote the group membership 
and management protocol that is being employed (for example, IGMP (Deering 
1989)). The IPSEC and IKE protocols (Atkinson 1995, Harkins & Carrel 1998) are 
also used — directly or through the GKM protocol -- for host-to-host key exchange 
and data confidentiality. 
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Figure 1: The Multicast/Conference Security Architecture 

In practice the GKM protocol typically has a central point which authorizes and 
distributes keys for the group. Here we will broadly refer to that entity as the 
Authorizer, which may generally correspond to the Group Key Controller in the 
work of Harney et al. (1997), to the primary core in the CBT-based solution of 
Ballardie (1996), or to the Key Distributor (KD) in the scheme of Harkins et al. 
(1997). The Authorizer entity can be a router, a server or other devices. 

In order to participate in the MCSA, the Authorizer entity must contain the 
components of the MCSA so that peer-level interaction would be possible. The 
Authorizer is assumed to run the GKM protocol to establish group-keys and 
perform high-level tasks, including certificate management, user access control, 
policy implementation, and others. The notion of an Authorizer is beneficial also 
from the point of view of security and access control policies, since the Authorizer 
can embody these policies. The Authorizer within a subnet can in fact be a subset 
of the policy hierarchy governing the entire autonomous system. 

From the Network layer’s perspective the Authorizer generates group-keys which 
are used by the group members to encrypt the payload within the multicast packets 
of a given group. Each group is assumed to have a secret (symmetric) key which is 
obtained by each member-host through a secure association (SA) with the 
Authorizer. Thus, further implied is the fact that each member-host must have a 
distinguished name (DN) (Adams and Farrell 1998) and must have a certificate 
before any security association can be established (Atkinson 1995). Hence, in our 
architecture we assume that each host has already been assigned a distinguished 
name and a certificate by the certification authority (CA). 



3.3 Components of the Architecture 

The MCSA consists of a number of components (Figure 2) which orchestrate the 
interaction between the multicast and the conference events. All parties involved in 
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the conference and multicast is assumed to employ the MCSA. Furthermore, we 
assume that the Authorizer contains all the peer elements at the Application layer 
and at the Network layer. The arrows in the diagram are simplified to denote 
interaction, both in terms of control/invocation and data/control flow. 

The Conference Session Manager (CSM) and the Multicast Session Manager 
(MSM) are the two components that look after the relevant events occurring at the 
Application layer level and at the Network layer respectively (Figure 2). The CSM 
works in conjunction with the conference application and the conference key 
generation and distribution system (CKGDS). The CSM maintains the state 
information for each conference of which the user is a participant, and it maintains 
a database of the corresponding conference keys. Although the CSM does not 
actually use the conference keys, it has access to the database in order to maintain 
a correspondence between the instance of the conference (since the user maybe 
involved in several simultaneously) and the key for that conference. 

The Multicast Session Manager (MSM) cooperates with the Conference Session 
Manager (CSM) in maintaining a correspondence between a conference instance 
and its underlying multicast instance. Depending on whether a host is the initiator 
of the multicast group or a receiver/sender in the group, the MSM has a number of 
tasks related to the creation of the multicast and the securing of it. The MSM is 
responsible for the initiation of the multicast group as a response to requests from 
the conference application. Through the session directory (SD) it coordinates the 
announcement of the new multicast group. It communicates with the peer MSM at 
the Authorizer in order to notify the Authorizer of the new group and request a new 
group key. 
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SAD = Security Association Database 
SD = Session Directory 



Figure 2: Components of the MCSA 

A user that wishes to join a conference must instruct its host (implementing the 
MCSA) to first join the multicast group corresponding to that conference. Joining 
the multicast group can be achieved through the GMP. After the host becomes a 
group member, it must open a secure association with the Authorizer in order to 
obtain the corresponding group key. The MSM must maintain a mapping between 
the multicast instances (of which the host is a member) and the security association 
contained in the Security Association Database (SAD), the SAD being part of the 
IPSEC definition. Through the security association, a copy of the group key is 
obtained from the Authorizer and it is stored in the group key database. This group 
key is used to encipher (decipher) traffic at the Network layer destined for 
(received from) the multicast group. 

Although not clearly shown, ideally the conference application and the 
corresponding CKGDS should deal with the MCSA through a suitable interface or 
API. Such an API should be usable with a variety of CKGDSs and may even be 
extendable to other applications that exhibit conference-like behaviours and 
security requirements. 




90 



3.4 Initiator and Sender/Receiver Interactions 

In the current architecture, we assume that at the conference level, the 
user/application initiating the conference is the Coordinator of the conference and 
the contact-point for other users wishing to join or leave the conference. The issue 
of the ejection of conference members is also the responsibility of the Coordinator. 

In the following we consider the MCSA as a unit from the perspective of a 
sender/receiver and an initiator of a conference. We assume that the initiator host is 
also where the conference Coordinator resides. 

Initiator MCSA: 

• Upon a conference-formation request from the application/user, the MCSA 
invokes the session directory (SD/SDP) to announce a new multicast session. 
The announcement must contain the identity of the Authorizer from which 
other hosts can obtain the group key. The announcement must also indicate 
that the multicast session will be part of a conference soon to be created. The 
application/user may also provide a list of acceptable (non-acceptable) users 
for the conference and the associated access control information. 

• The MCSA creates a security association with the Authorizer and notifies the 
Authorizer about the new multicast session. The MCSA also provides it with 
access control information (for access at the user level and the host level). The 
Authorizer prepares a group key for the multicast session. 

• The MCSA uses the security association to obtain a copy of the group key. 

• The MCSA invokes IPSEC to encipher all subsequent traffic destined for the 
multicast session. 

• After a given period of time (to allow other hosts to become members of the 
multicast session) the MCSA initiates the transmission of a conference-call 
message through the multicast session, with the announcement details being 
encrypted using the group-key. It names its own CKGDS as the Coordinator 
of the conference. 

• The MCSA waits for the conference-call-responses from the members of the 
multicast session, and notifies its conference application and CKGDS about 
the group-members that request to join the conference. 

• The MCSA notifies the CKGDS that all peer MCSA are waiting for the 
conference key generation and distribution to commence. The Coordinator 
CKGDS (at the Initiator host) then begins the conference key generation and 
key distribution phase, culminating in each application having a copy of the 
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conference key. Note that all traffic during (and after) the conference key 
generation/distribution are encrypted at the Network layer using the multicast 
group key. 



Receiver/Sender MCSA: 

• Upon user request the MCSA invokes the session directory tool to determine 
active multicast sessions, (ie. start time, duration, IP addresses, formats, etc). 
Included here is the address of the Authorizer. As an aside, the user may also 
request the Authorizer to provide it with a copy of the Authorized certificate 
(signed by an acceptable global authority) to prevent masquerading. 

• The MCSA notifies the user/application about the active multicast sessions 
and asks the user to select the multicast sessions to join. 

• The MCSA invokes the GMP to notify the local subnet router about the 
request to join a particular multicast session(s) which comprise a conference. 
(This step depends to a large extent on the multicast protocol being 
employed). 

• Upon its host becoming a member of the multicast session, the MCSA invokes 
the GKM to obtain the group key. One way to obtain the key is to create a 
security association (SA) with the Authorizer, and then to use the resulting 
secured channel to download a copy of the key. 

• The MCSA invokes IPSEC to decipher (encipher) all subsequent traffic 
received from (destined for) the multicast session using the group key. 

• Upon seeing a conference-call message (issued through the peer MCSA at the 
Coordinator host) the MCSA responds to the call and notifies its own CKGDS 
of the ready-status of the Coordinator CKGDS. After some conference 
parameter negotiations, the CKGDS participates in the conference key 
generation and distribution scheme. 

Note that in the current architecture the conference application can still employ its 
own mechanisms for confidentiality and authentication at the Application layer. 
This approach may be preferable in certain circumstances, where the security 
requirements are more stringent and where user-to-user security must be 
established (eg. using smartcards). 

In the current architecture, the group key at the host level only affords group 
authentication. That is, other member-hosts can be assured implicitly only that the 
data originated from a valid group-member (unless other additional means is 
employed). If sender authentication at the Network layer is also required in order 
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to identify the source-host, then each member-host must embed its signature as part 
of the payload. This is because IPSEC does not provide explicit means to include 
signatures for authentication for each data packet. Digital signatures can be 
employed, while less explicit signing mechanism are also available. 



4 REMARKS AND CONCLUSION 

In this paper we have argued that from a security perspective there is much to be 
gained by employing a “secured” IP multicast at the Network layer to support the 
formation and management of secure conferences at the Application layer. A 
secured IP multicast -- with group authentication and confidentiality — already 
achieves a reasonable level of security, and therefore fulfils a large part of the basic 
requirements of secure conferencing. 

The current work has viewed and discussed group-oriented security from two 
perspectives, namely secure conferencing at the Application layer (via conference 
key generation and distribution systems) and group key management protocols at 
the Network layer. It has also attempted to classify multicasts and conferences in 
order to find similarities and differences in behaviour at the two layers. 

Following from this the Multicast/Conferencing Security Architecture (MCSA) 
was proposed that provided a way to identify components at the Session layer that 
could act collectively as an intermediary between the conference application 
program at the Application layer and the multicast-related protocols (host-side or 
client-side) at the Network layer. Part of the internal task of the MCSA is to map 
instances of conferences to that of multicasts in a secure fashion with the aid of an 
Authorizer entity. The interaction of the relevant parties in the multicast and 
conference has also been briefly outlined. 

There are still a number of open problems in the area of group-oriented security, 
particularly with respect to practical GKM protocols. These include the 
introduction of a group security association (GSA) for IP multicast in the spirit of 
IPSEC, the design of a GKM protocol for inter-domain key distribution suitable for 
the various IP-multicast applications, and others. These, as well as other issues 
specific to the MCSA, will be the directions for future work. 
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Abstract 

With the advent of multimedia applications, the support of on-line multicast- 
ing with quality of service (QoS) guarantees has gained considerable attention 
in the field of communication networks and distributed systems. The objective 
of this paper is to investigate on-line QoS-based routing and path establish- 
ment schemes to support point-to-multipoint connections in wide area net- 
works. We propose SELDOM, a Simple and Efficient Low-cost Delay-bounded 
Online Multicasting scheme to support on-line multicasting. The scheme is 
particularly tailored to networks in which group membership changes fre- 
quently. 

The approach taken by the scheme is unique in the sense that, given a 
set of QoS requirement specifications of each multicast node and the current 
status of the network links, SELDOM finds a minimum cost multicast tree 
that meets these QoS specifications of the supported group members. The 
scheme handles join requests dynamically by determining the least cost path 
which satisfies the required delay bounds to which the new group member 
is to be attached. On the other hand, to handle a leave request, the scheme 
seeks to limit the rearrangement required in order to reduce the disturbance 
such a request may cause to current members of the group. The worst time 
complexity of SELDOM is 0(n 2 ). 

Keywords 

online multicast routing, steiner tree, quality of service 
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1 INTRODUCTION 

Recent advances in high-speed networking technology have created opportu- 
nities for the development of a wide spectrum of sophisticated multimedia 
applications which generate, integrate, process, store, and distribute time de- 
pendent and time independent media. Typical applications include video con- 
ferencing, computer supported collaborative work, and limited video broad- 
casting. These applications are characterized by a wide spectrum QoS require- 
ments and the need for group communications among multiple end-points. 

Support of QoS-based group communication in multimedia environments 
requires the development of efficient and cost-effective multicast algorithms. 
The ability to perform such a task is becoming a major requirement for com- 
puter networks that support multimedia applications. To increase the fraction 
of accepted multicast sessions, the network must use the minimum amount of 
network resources while guaranteeing the sessions’ QoS requirements. From 
the routing point of view, an efficient multicast algorithm must only replicate 
packets when necessary, namely at the branching points at the tree. 

In the past, the bandwidth required by applications was small and the 
applications’ QoS requirements were not as stringent as those of current mul- 
timedia applications. Hence, simple multicast algorithms were used to manage 
the network resources. However, with the advent of multimedia applications, 
developing efficient multicast algorithms is becoming increasingly important. 
To increase the rate of accepted multicast sessions, new algorithms that min- 
imize the amount of replicated traffic exchanged during multimedia multicast 
session must be developed. These algorithms must guarantee the stringent 
QoS requirements of multicast sessions while minimizing the cost of the mul- 
ticast trees used to exchange the resulting traffic. 

The multicast group set can be known before setting up the multicast rout- 
ing tree. In this case, the problem is called the off-line multicasting problem. 
What is required in this case is an algorithm that, given a set of QoS require- 
ment specifications, current status of the network links, and the multicast set, 
find a Minimum Cost Multicast Tree (MCMT) that meets the QoS specifica- 
tion of the multicast tree nodes. 

Multicast applications such as teleconferencing, distance learning, collabo- 
rative work, and data distribution may require the ability to support dynamic 
sessions. Dynamic sessions are characterized by their members’ ability to join 
or leave in a dynamic fashion. The sudden and unexpected arrival and de- 
parture of session members makes the multicast problem an on-line problem 
where routing decisions have to be made on-line while the multicast session 
is in progress. Hence, the multicast group set is not known prior to setting 
up the multicast session. This problem is called the Online Minimum Cost 
Multicast Tree (OMCMT) problem. 

The objective of the multicast problem in a multimedia network is to build 
a low cost tree that bounds the source-destination delay. That is, given a 
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graph G = (V, E), where each link is assigned cost and delay, a source node s, 
a multicast set D, find the lowest cost multicast tree that bounds the source- 
destination delay. This problem is NP-complete. That can be easily proved 
by setting up the delay bound to infinitely which reduces the problem to the 
Steiner tree problem. The Steiner tree problem is a well known NP-Complete 
problem [19]. Several exact solutions to the Steiner tree problem have been 
proposed in [8, 19]. All proposed exact solutions, however, require exponential 
execution time. This prompted the development of several polynomial heuris- 
tics for approximate solutions. A survey of these heuristics, as well as exact 
algorithms, for Steiner problems in networks is provided in [19]. 

The rest of the paper is organized as follows: we start with reviewing some of 
the approaches and algorithms proposed to provide an approximate solution to 
the off-line, low-cost, delay-bounded multicast trees. After that, a discussion of 
low-cost online multicasting algorithms, which update and maintain multicast 
trees dynamically in response to join or leave requests, is presented. Section 
3 introduces SELDOM which is a new proposed algorithm for the online, 
low-cost, delay-bounded multicast problem. A conclusion of this work will be 
presented in the last section. 



2 RELATED WORK 

Based on their design objectives, the multicast algorithms proposed in the 
literature can be viewed as members of one of three possible classes. The first 
class includes algorithms which are designed to accommodate an Internet 
based environment. The second class includes off-line algorithms which aim 
at reducing the cost of the multicast tree, while bounding the end-to-end 
delay. The third class includes algorithms which deal with on-line multicasting. 
These algorithms are reviewed next. 



2.1 IP-based Multicast Protocols 

The Internet community proposed different algorithms to create multicast 
trees, including Distance Vector Multicast Routing Protocol (DVMRP), Mul- 
ticast Open Shortest Path First (MOSPF), Protocol Independent Multicast- 
ing (PIM), and Core Based Trees (CBT) [13, 14, 7, 3]. DVMRP is built on 
top of RIP (Routing Information Protocol), a distance vector protocol, which 
is not efficient in detecting loops and link failures quickly. MOSPF uses Open 
Shortest Path First (OSPF) to maintain a current image of the network topol- 
ogy. CBT builds a single distribution tree, formed around a focal router which 
is called the core. The major drawback of CBT is the concentration of traffic 
at the core of the tree. Hence, CBT is vulnerable to core failure which can 
partition the tree. The Protocol-Independent Multicast (PIM) addresses both 
dense (PIM-DM) and and sparse (PIM-SM) environments [7]. PIM-DM is en- 
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visioned to be used in an area where group membership is dense. PIM-DM is 
similar to DVMRP except the unicast routing information is imported from 
the existing unicast protocol rather than incorporating it in the implemen- 
tation of the unicast protocol. PIM-SM creates a center in the tree which is 
called a Rendezvous Point (RP). Each multicast group has a default router 
as its RP. New receivers join the tree through the RP. A receiver can switch 
from the shared tree to a source based tree. Upon switching the source prunes 
itself from the shared tree. 

The above Internet multicast algorithms are designed to work specifically 
with the current IP environment and to take advantage of the IP routing 
protocols such as RIP and OSPF. The design objectives of these algorithms 
focused on issues related to scalability and reduced communication overhead, 
but did not address the QoS requirements of multimedia applications. 



2.2 Off-line, Low-cost, Delay-Bounded Multicast Tree 
Problem 

In addition to low cost, multimedia applications have different demands in 
terms of bandwidth, reliability, delay and jitter. A key property of multimedia 
data is its time dependency. The support of sustained streams of multimedia 
objects, over a period of time, requires the establishment of reliable, low delay 
and low cost source-to-destinations routes. Nevertheless, the objective is not 
to develop a strategy which produces the lowest possible end-to-end delay, 
but a strategy to ensure that the data traffic arrives within its delay bound, 
thereby allowing a tradeoff between delay and cost. Thus, the objective is to 
produce a tree that has minimal cost among all possible trees that bound 
end-to-end traffic delay between all source-destination pairs. 

Many heuristics were developed for the low-cost unbounded-delay multicast 
problem [20, 2]. However, there are few attempts to develop low-cost bounded- 
delay multicast heuristics. In the following, we review off-line, low-cost, delay- 
bounded, multicast heuristics. A simple approach to solving this problem is 
to use a tree that is composed of the least delay paths from the source to the 
multicast nodes. Such an approach will always find a solution that conforms 
to the delay bounds if one exists. This approach, however, does not take into 
consideration any cost optimization. A different heuristic for solving the delay 
constrained multicast tree is to use the constrained shortest path tree [15]. 
This heuristic first builds a tree composed of the shortest cost paths to the 
destinations. If the end-to-end delay to any group member is violated, the 
least delay path will be used instead. 

A dynamic programming approach was suggested by Kompella, Pasquale, 
and Polyzos [12]. This heuristic assumes that link delays are represented by 
integer values. The heuristic begins by finding the least cost bounded path 
from each multicast node to another. Then, it uses the minimum spanning 




99 



tree algorithm to connect the multicast nodes without violating the end-to- 
end delays. The complexity of this approach depends on the granularity of the 
delay values. If the granularity of the delay is very small then the complexity 
will be large. 

An iterative optimization approach to the minimum delay tree was sug- 
gested in [21]. The algorithm starts with the minimum delay tree. Then, it 
replaces the relay paths with lower cost paths without violating the delay 
bound. It continues until the cost of the tree cannot be reduced further. 

A simple heuristic based on the Simple Path Heuristic (SPH) [16] was pro- 
posed in [1]. The heuristic Least Cost First (LCF) decouples the cost opti- 
mization from bounding the delay by building a low cost tree incrementally 
and then checking the delay bound requirements. The node with the least cost 
path to the tree is selected. If the path to that node violates the delay bound, 
the least-cost delay-bounded path out of the possible low cost paths from the 
tree to that node is used instead. This process continues iteratively until all 
multicast nodes are included. Failure to include all multicast nodes implies 
that no multicast tree which satisfies the QoS requirements for all multicast 
nodes exists. The analysis shows that the performance and complexity of LCF 
heuristic are comparable to those of the SPH approximation if the number of 
delay violations remains moderately small. 

Most of the above algorithms were designed for undirected networks. How- 
ever, they can be implemented in a directed network. Also, they are only 
designed for static multicast trees (off-line). In the rest of this paper we dis- 
cuss online multicast tree. All of the online heuristics that we are aware of 
address low cost online multicasting with no consideration to delay bound. In 
the following we will discuss some of these online multicasting heuristics. 



2.3 Online Heuristic Algorithms 

The multicast group membership in typical multimedia settings, such as on- 
line video conferencing or multimedia group authoring, dynamically changes 
as new members request to join the group or current members request to 
leave the group. Therefore, supporting dynamic multicast applications effi- 
ciently requires adding or deleting members to the multicast tree efficiently 
and transparently to the other multicast members. While many research works 
have addressed static multicast group communications in WANs, very little 
research has considered the dynamic version of the multicast communication 
problem. 

An intuitive and trivial solution to this problem is to rebuild the tree using a 
static algorithm whenever a join request by a new member, or a leave request 
by a current member, must be handled by the network group management 
protocol. However, such a solution may have repercussions for members who 
remain in the group since there may be a disturbance in the communication. 
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Furthermore, such a change may cause packets to arrive out of order. Another 
solution is to permit only local or partial reconfigurations when modification 
to the group membership are required. Yet another approach is to start with 
an optimal tree and make minimal changes as group membership changes 
without causing disruption to the members who remain in the group. This 
approach, however, may not be as efficient as the other approaches because 
no reconfiguration of the current tree is allowed. 

Waxman was one of the first researchers to address the online multicast tree 
problem [17, 9]. In his work, Waxman partitioned the on-line multicast heuris- 
tics into two types: the ones that do not allow rearrangement (nonrearrange- 
able) of the multicast tree and those that allow rearrangement ( rearrangeable ) 
when the cost exceeds some limit. Several heuristics which approximate the 
OMCMT problem have been suggested [17, 4, 6, 10, 11, 18, 9], but none of 
these address supporting the delay requirements of multimedia applications. 
Following is a review of some of these heuristics. 



(a) On-Line Greedy Heuristic (OGH) 

This heuristic works as follows. In response to a join request, a node is added 
to the multicast tree using the shortest path from the current multicast tree to 
that node. For each leave request, the node is marked as a non-multicast node 
and is deleted only if it is a leaf node. This is achieved by removing the leaf 
node from the multicast tree and all branches linking that node to the tree 
[17]. Imase and Waxman proved that in the case where only node additions 
are allowed, the worst case cost scenario of the multicast tree produced by 
OGH is no worse than twice the cost of the multicast tree produced by the 
best nonrearrangeable algorithm for the online Steiner tree [9] . 



(b) Edge Bounded Algorithm 

Edge Bounded Algorithm (EBA) is a rearrangeable algorithm in which a 
partial rearrangement is permitted when a modification to the membership 
occurs. EBA bounds the worst case performance of the generated tree to 4a 
times that of an optimal Steiner tree, where a is a constant value [9]. Also, 
it limits the number of rearrangements to 0(K 3 / 2 ) where K is the number of 
(join, leave) requests served. 

The algorithm works by creating distance graphs G' and T . G ' is a graph 
derived from the original graph G. The nodes of G' are those of G. The edges of 
G', however, are built in a way such that G' is a complete graph. Furthermore, 
the weights of G's edges are the costs of the minimum cost paths between the 
nodes of G. A multicast tree T' is created for the node set Z (Z = {s} U D) 
from G' by pruning the minimum spanning tree of G'. For each join request, 
EBA selects the least cost path from the new node v to the closest node u in 
T'. EBA verifies that the added path is a bounded by ensuring that the cost 
of the maximum cost edge on the path between v and any node u in T' does 
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not exceed a times the cost of edge (v, u) in G f . If a path to a node u exceeds 
that limit, v and u will be connected using the least cost path. 

Based on this algorithm, a delete request issued by a node v is handled 
in a way that depends on the degree of v. If the degree of v is one, v is 
removed using a procedure similar the one described in OGH. If the node has 
degree three or more, the node will be marked as deleted and no action will 
be taken. If the degree is two then the node will be removed along with the 
two adjacent edges. That will create two subtrees. These two subtrees will be 
connected using an edge that minimizes the cost of the path between the two 
subtrees. 

After serving any join or leave request, EBA verifies that T' is still an 
extension tree. An extension tree is a tree in which the degree of any non- 
multicast member is greater than two. If the degree of a non-multicast node is 
two, the same process used to remove a node whose degree is two is undertaken 
to build an extension tree. 

(c) Shortest Path Tree 

Doar and Leslie suggested adding a node by using the shortest path from 
the source to that node [6]. Thus, the multicast tree will be the union of the 
minimum source-to-destination shortest paths. Such a tree will give the same 
result whether the tree was built dynamically or statically. As a result, this 
approach makes the process of building multicast trees less prone to major 
spikes of inefficiency. Furthermore, the algorithm does not require handling of 
rearrangements when nodes join or leave the session. Doar and Leslie showed 
that such a tree is on average more than 60% worse than the optimal multicast 
tree. They also suggested imposing a hierarchal model which emulates a real 
network architecture composed of major backbones and subnetworks. With 
such a model they showed, by simulation, that the resulting trees are on 
average less costly than trees produced by non-hierarchal model because there 
is more sharing of the backbone links. 

(d) The Geographic-Spread Dynamic Multicast Heuristic 
(GSDM) 

GSDM is a rearrangement heuristic which was proposed by Kadirire [11]. It 
is an optimization of the OGH. To illustrate this process, assume that a given 
node A issued a request to join the multicast tree. Furthermore, assume that 
node B is the closest tree node to node A, and nodes C and D are the two 
closest nodes to B . Based on this configuration, the heuristic selects the least 
cost path among C-B-D & B-A, C-B-A-D, and C-A-B-D. If more than one 
minimum cost path exists, the heuristic selects the path that maximizes the 
Geographic Spread(G S) of the resulting tree [10]. 

The GS is defined as follows: let T be a tree that spans Z (Z = {s} \J D) 
where Z C V and let v E V and z £ Z, the GS is defined as the inverse of 
the sum of the minimum distance from v to all z £ Z for all nodes v E V as 
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shown in equation 1. It has been shown that GSDM heuristic usually performs 
slightly better than OGH [11]. 



GS(T) = 



£ SP ( V ' Z ) 

.vevkzeT 



(i) 



where SP(v,z) is the shortest path between v and z. 

(e) ARIES 

ARIES (A Rearrangeable Inexpensive Edge-Based On-Line Steiner Algorithm) 
is a rearrangeable heuristic [4]. ARIES performs a rearrangement of a region 
of the multicast tree when the number of modifications (join, leave) within 
that region reaches a threshold. A region is defined as the part of the multicast 
tree whose interior nodes are non-stable nodes. A stable node is a node that 
has never been modified since the start of the tree or the last rearrangement 
of that region. The performance and the time complexity of ARIES algorithm 
depend on the threshold value. Small threshold values improve ARIES’s per- 
formance and increase its run time, and vice versa. 



3 HEURISTIC SELDOM 

The objective of the OMCMT problem is two fold: minimizing the multicast 
tree cost and bounding the delays from the source to the destinations. Mini- 
mizing the cost by itself is a harder problem since the problem reduces to the 
Steiner tree problem which is inherently NP-complete. On the other hand, 
finding a multicast tree which satisfies the specified delay requirements can 
be solved in polynomial time using one of the classical minimum cost path 
algorithms [5]. When the two parameters are combined together, the problem 
is still NP-complete, even in the static case. 

On-line multicast algorithms must handle dynamic group membership re- 
quests to join or leave a multicast session in progress. These requests require 
updating the multicast tree by adding the joining node or removing the leav- 
ing node from the tree in a way such that the overall cost of the tree remains 
minimal and the delay bounds of all multicast nodes are still satisfied. This 
online version of the multicast problem is still NP-Complete both in a rear- 
rangeable and nonrearrangeable setting. Therefore, some heuristic that gives 
a “good” approximation with low run time overhead is needed. Furthermore, 
the heuristic must avoid disrupting the current connections and cause a min- 
imal amount of change to the current connections. More specifically, in the 
case of joining request, the objective of an online multicast algorithm is to find 
a low-cost, delay-bounded path to the new joining node, while maintaining 
the lowest possible number of arrangements. Furthermore, in the case of leave 
request, the algorithm should delete a leaving node in a way such that the 
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delay bounds are not violated and the number of arrangements remains low. 
In the following, we first formulate the online multicast problem. We then 
propose a new online heuristic, referred to as Simple and Efficient , Low - Cost , 
Delay-Bounded Online Multicasting (SELDOM), which provides an effective 
solution to handle join and leave requests efficiently. 



3,1 Problem Definition 

A point-to-point communication network can be represented as a directed 
graph G = (V, E) where V denotes a set of nodes and E a set of asymmetric 
links. The network is assumed to be full duplex. In other words, the existence 
of link l = (u, v) £ E implies the existence of link V = (v, u) £ E . Each 
link l £ E is assigned a cost value C : E — > 7 £ + . A link cost value can 
be either the link utilization or a monetary value associated with the link. 
Also, each link l £ E has delay £(/), where 6(1) £ 7 Z + . A link delay may 
consist of CPU processing, queuing, transmission and propagation. Because 
network links are often asymmetric, it is often the case that C(u , u) ^ C(v , ti) 
and J(u, v) ^ £(v, u). If the links are symmetric, then C(u, v) = C(v , u) and 
S(u, v) = S(v , u). 

A path from node v\ to Vk is defined as the sequence of nodes and links 
P(^i, Vk) = vi, V2 , Vk such that (v t -, v*+i) £ E for all nodes from Vi to Vk-i- 
The path P(v i, Vk) is assumed to be loop free. The cost of a path is the sum 
of the cost of the links constituting P(vi, Vk): 

Cost(P( v lt v k ))= Y, C ( l ) (2) 

l€P(v i,v k ) 



Similarly, the total delay of a path is the sum of the delay of the links consti- 
tuting P(vi, v k ): 



Delay(P(vi,Vk)) = 6(1) 

l€P(v u v k ) 



(3) 



The general MCMT problem can be defined as follows: let the multicast 
group consist of a source node s and a destination node set D. Given that the 
maximum delay allowed on the path P(s, d), where d £ D, is A*, the MCMT, 
T, is the tree that satisfies the following two conditions: 

Cost(T) = J2 C ( l ) 

l£T 



is minimum 



(4) 
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subject to 

Delay(P(s,d)) = £ 6(1) < A d Vd G D (5) 

/€PM) 

In some cases, the underlying multicast application is delay insensitive. In 
this case, the delay is not bounded (Delay (P(s 1 d)) = oo, V d € D ), and 
the MCMT problem reduces to minizing the total cost of the multicast tree. 
Furthermore, if the links are undirected, the MCMT is reduced to finding the 
minimum cost tree, T, that contains all nodes {$} U D = Z. This problem is 
well known in the literature and is called Steiner Minimal Tree (SMT) [8]. 

The above definition applies to the case where the multicast nodes are 
known a priori (i.e. off-line multicasting). However, when the problem is on- 
line, the multicast tree is dynamic; the multicast nodes in this case may join 
or leave dynamically. This problem is called the Online Minimum Cost Multi- 
cast Tree (OMCMT) problem. In this case, a request vector R = (rq, 7 * 2 , ..., r*) 
with k requests is given where r,- has three parameters (v, x,A v ) such that 
v G A x G {add, remove} and A v is the delay bound to that node. The OM- 
CMT tree in this case is defined as the tree that satisfies the two conditions 
in equations 4, 5 after processing each join or leave request. 



3.2 SELDOM Design Approach 

Given that some of the networks are not always congested and the number of 
delay violations may be low, the approach taken by SELDOM to efficiently 
handle online multicast requests is to first minimize the cost of the paths to 
destinations, and second verify the feasibility of these paths in supporting the 
required delay bounds. This approach is being taken since the cost reduction 
is inherently an NP-complete problem while bounding the delay is a simpler 
problem which can be performed in polynomial time. 

In response to a join request, SELDOM determines the least cost path from 
the multicast tree to the new node, and verifies the feasibility of the selected 
path in meeting the delay bound requirements of the joining node. If the 
delay requirements of the new added node are violated, SELDOM searches 
for a delay-bounded path with the lowest cost. Following is a description of 
two modes of SELDOM operations, namely nonrearrangeable SELDOM, and 
rearrangeable SELDOM. 



3.3 Nonrearrangeable SELDOM 

In a connection oriented network, it is important to reduce the number of 
rearrangements of current connections as a multicast tree evolves. This is 
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especially important when the network is loaded and the network control 
and routing information is distributed. Rearrangement involves rerouting of 
information and may require a significant amount of network synchronization 
and resources. Therefore, a nonrearrangeable online multicast algorithm is 
desirable when rearrangement of the multicast tree is difficult. 

First, the SELDOM nonrearrangeable mode is presented. In this mode, 
SELDOM does not require any rearrangement of the multicast connections. 
The nonrearrangeable mode of SELDOM works as follows: for each incoming 
request, if it is a leave request, the node is marked as a non-multicast node 
and it is deleted only if it is a leaf node. The deletion is achieved by removing 
the leaf node from the multicast tree and all branches linking that node to 
the tree. If it is a join request then the following algorithm is used: 

Let G be the network graph, T be the current multicast tree, and v be the 
node to be added. 

1. Find the least cost path SP(T,v) from T to v. If SP(T,v) total delay is 
bounded then return that path and quit. Otherwise, perform the following 
steps. 

2. Create a new graph G f which consists of G’s nodes and G’s edges reversed. 

3. Let P be the set of the shortest delay paths from v to T’s nodes in G' . 

4. Remove the edges of T, from G'. 

5. Find the set of the least cost paths from v to T’s nodes in G' and add them 
to P. 

6. Out of P, pick the path p which satisfies the delay bound such that 
delay(p) < A v and cost(p ) is minimum. 

7. If no such path exists then return “cannot add this node”. 

8. Return p. 



G’s links are reversed to speed up the shortest path calculations. The edges 
of T' are removed to create independent paths. To explain the above idea 
further and to show the benefit from removing T links, we give the following 
example. In Figure 1 the source node is node 1 and the current multicast tree 
includes multicast node 4 (in addition to the source node 1) and the maximum 
delay bound for all multicast nodes is 7. Initially the multicast tree consists 
of nodes 1,2, and 4 with link (1,2) and (2,4). The cost of the tree is 2 and 
delay to node 4 is 4. Assume that a new request comes to add node 5 to the 
multicast tree. Using SELDOM, it will first try adding node 5 using the least 
cost path from T to 5 which is path (4,5). However, the delay on that path 
is 8 which violates the delay bound. Therefore, SELDOM will try to find a 
better delay-bounded path. First, SELDOM will create a new graph G ' which 
is similar to G but the links are reversed as show in Figure 2. The reversal 
of the links is performed to speed up the shortest path computation from the 
violating node to the tree. Then, SELDOM will compute the least delay paths 
in G from the T nodes to node 5. This can be performed in 0(n 2 ) using the 
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Figure 3 SELDOM example with a source node and two destinations, 2 and 
6, each link is assign two values (cost and delay). 



shortest delay paths in G'. These paths are: (5,2,1), (5,2), (5,4). After that, 
the T links ((1,2), (2, 4)) are removed from G f . The least cost paths from node 
5 to the T nodes will be computed. These paths are: (5,3,1), (5,3,2), and (5,4). 
Out of these six paths, path (5,3,2) gives the least-cost bounded path with 
additional cost of 4 and a total delay (from the source to node 5) of 4. If the 
T links were not removed, the least cost paths will be (5,4,2, 1), (5,4,2) and 

(5.4) . These paths are not independent because they use T’s links (1,2) and 

(2.4) . Hence, without removing T links, path (5,3,2) would not be discovered. 



3,4 Rearrangeable SELDOM 

Handling join and leave requests in rearrangeable networks is more involved 
than in nonrearrangeable settings. In the following, we first describe the op- 
erations SELDOM undertakes to respond to a join request. We then describe 
the operations required to handle a leave request. 

(a) Join Request 

When adding a node to the current multicast tree in response to a join request, 
the nonrearrangeable SELDOM may produce a multicast graph which has a 
node with two incoming paths. As an example, consider the graph depicted 
in Figure 3. In this graph, the multicast set consists of the source node and 
two destination nodes, 2 and 6, with a delay bound of 10. The multicast tree 
in this case is marked using bold links with total cost of 9. Assume that node 
5 wants to join the multicast set. In this case, node 5 can be added using 
the nonrearrangeable SELDOM heuristic. Path 2,5 cannot be used because 
its traffic will come through link S,2 with a path delay of 11 which violates 
the delay bound. In this case, node 5 will be added to the multicast tree using 
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Figure 4 SELDOM example after adding node 5. 




Figure 5 SELDOM example after pruning the link from the source to node 

2 . 



path 1,2,5 as shown in Figure 4 with a total tree cost of 17. This new path 
makes node 2 have two incoming paths. The generated multicast tree satisfies 
the delay requirements but this configuration makes node 2 do double work 
for the same packets. 

A multicast tree with a node which has two incoming paths will deliver 
the multicast traffic as it should. However, that will increase routers’ state 
information. In the above case, a router will have to remember what to do, 
although it handles the same packets. Also, that will double the workload of 
the router for the same packets because the router is forced to process the 
same packets twice. Furthermore, it will increase the cost of tree. 

To reduce and possibly eliminate this extra overhead, one of the two incom- 
ing paths has to be pruned. Pruning a path, however, should be performed 
in a way such that no delay bounds to any of the multicast destinations are 
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violated. In order to achieve this, the path with the larger delay should be 
pruned. Thus, path S,2 will be pruned because it has a delay of 6 whereas 
path S, 1,2 has a delay of 2 as shown in figure 5. The total cost of the multicast 
tree after pruning is 15. 

Lemma: if the current multicast tree does not have a node with more than 
one incoming path, then when a node is added, the new path will not create 
a node with more than two incoming paths. 

The above can be proved as follows. When a node is added, the shortest 
path from a tree node to the new node will not have a cycle because any 
shortest path cannot include cycles. Therefore, a path will not go through a 
node more than once. Consequently, the new path will not add more than 
one extra incoming path to any node in the multicast tree. Since no multicast 
node has more than one incoming path, the resulting multicast graph will 
never have a node with more than two incoming paths. 

Based on the above we propose a rearrangeable mode of SELDOM. The 
node join process of this mode tries to reduce the overall cost by pruning the 
extra paths while minimizing the number of rearrangements. For every node 
addition it finds all possible paths as it is done in the the previous mode. 
Then, for each possible path that satisfies the delay bound, it computes the 
cost of the multicast tree by adding the cost of the new path minus the cost of 
any possible pruned paths that can exist if that node is added. More formally, 
the node-join algorithm is defined as follows: 

1. Find the least cost path SP(T,v) from T to v. If SP(T,v) total delay is 
bounded then return that path and quit. Otherwise, perform the following 
steps. 

2. Create a new graph G' which consists of G’s nodes and G’s edges reversed. 

3. Let P be the set of the shortest delay paths from v to T’s nodes in G'. 

4. Remove the edges of T, from G'. 

5. Find the set of the least cost paths from v to T’s nodes in G' and add them 
to P. 

6. For each possible path in P, compute the cost of resulted multicast tree 
including the new possible path. The cost of the tree is computed by finding 
the current cost of the multicast tree including the new path minus the cost 
of any possible pruned paths that can result from adding that path. 

7. Out of P, pick the path p which satisfies the delay bound such that 
delay (p) < A v and the total cost(T) is minimum. 

8. If no such path exists then return “cannot add this node”. 

9. Return p. 

If all nodes that joined the multicast tree directly use the shortest cost 
paths without violating any delay bounds, then the produced tree will be 
similar to the tree produced by OGH (online Greedy Heuristic). It was shown 
by simulation that when there are only node additions, OGH produces trees 
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with an average cost that does not exceed that of the optimal tree cost by 
more than 10%. Imase and Waxman [9] proved that in the worst case the cost 
of the multicast tree produced by OGH in an undirected graph is no worse 
than twice the cost the multicast tree produced by the best nonrearrangeable 
algorithm. 

(b) Leave Request 

In the nonrearrangeable mode, the deletion of a node in response to a leave 
request, causes SELDOM to mark the node as a non-multicast node. Further- 
more, if the node is a leaf node, all edges and nodes in the relay path linking 
that node to the tree, including the leaving node itself, are removed from the 
tree. In the rearrangeable SELDOM, however, node deletion can be improved 
by performing limited arrangement. The basic steps undertaken by SELDOM 
to remove a node in response to a leave request are described bellow: 

Assume the multicast tree receives a leave request from node v. Let deg(v) 
be the degree of node v. Also, let a relay path be the path whose all internal 
nodes have degree two and they are not multicast members. Then the node 
deletion algorithm can be explained as follows: 

If node deg(v) > 2 then 

mark v as a non-multicast member. 

else if deg(v) = 2 then 

Delete the relay path from v up to node vi e j t and the relay 
path up to node v r i g ht where vi e j t is the end node of the relay 
path on the left of v and v r i g ht is the end node of the relay 
path on the right of v. The above will divide the multicast tree 
into two subtrees. 

Assume that the source node is in v/ e /t subtree. Also, assume T\ is 
bright subtree. Reconnect node vi e j t and T\ using the least 
cost path from vi e f t to T\ which does not violate the delay bound. 
The least cost path is the lowest-cost, delay-bounded path among 
the following paths: the least-cost paths from vi e f t to every T\ 
node, the least delay paths from vi e j t to every T\ node, and 
the old path between vi e f t and v r i g ht . 



Delete v and its relay path up to u where u is the end node of the 
relay path. If u is not a multicast member then delete u along with 
its two relay paths up to m e f t and u r i g ht . Assume that the source 
node is in u* e /t subtree and T\ is the subtree of u r i g hf Reconnect 
ui e ft and T\ in a way similar to connecting vj e /t and T\ in 
the above. 
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The above deletion algorithm limits the number of rearrangements to one. 
Because it requires finding the shortest path tree, the time complexity of the 
algorithm is 0(n 2 ). The above enhanced deletion algorithm is expected to 
reduce the cost of the multicast tree. However, the algorithm can be improved 
even further. To illustrate this improvement, assume that T\ is the subtree 
that includes the end node of the left relay path and T 2 is the the subtree that 
includes the end node of right rely path. Then, a better path is the one that 
connects T\ subtree with T 2 subtree instead of the best path that connects 
one node with the other subtree. However, that makes the algorithm more 
complex because it requires finding the least cost paths from every T\ node 
to every T 2 node. This process has a time complexity of 0(n 3 ). 

Because SELDOM uses the shortest delay paths, it finds a solution if one 
exists. The time complexity of a node addition is similar to the time complex- 
ity of the shortest path algorithm which is 0(n 2 ) where n is the number of 
nodes in the graph. That is because in the worst case it requires computing 
the cost of the least-delay paths and the least-cost paths from the multicast 
nodes to the node being added. By reversing the direction of the links, this still 
can be done in 0(n 2 ). The complexity of link pruning will never exceed the 
number of links. Similarly, the time complexity of node deletion is bounded 
by 0(n 2 ). Hence, the overall time complexity of SELDOM is 0(n 2 ) for each 
node addition or deletion. 



4 CONCLUSION 

The multicast problem is NP-complete. Therefore, some heuristics were sug- 
gested to give a good approximation with polynomial time complexity. This 
paper started with a discussion of low cost, delay-bounded multicast trees. 
Next the online multicast problem was discussed. The online multicast prob- 
lem is difficult because members join and leave the multicast tree dynamically. 
One possible solution is to use any of the static multicast heuristics to solve 
the online multicast problem. However, that will be costly because it requires 
tearing down and reconnecting the current multicast tree connections. A few 
heuristics for the online problem were presented. Some of these heuristics did 
not require rearrangements while the others tried to limit the number of re- 
arrangements. The heuristics that allow rearrangement, however, usually give 
better results. All of these suggested online heuristics do not bound the delay. 

A new heuristic, SELDOM, for online, low-cost multicasting for real-time 
applications was presented. The nonrearrangeable mode of SELDOM adds a 
node using the shortest path if that path does not violate the delay bound. 
If the path violates the delay bound, a search for a low-cost, delay-bounded 
path is performed. A node is deleted by removing the node and the relay path 
from the tree to that node. A rearrangeable mode of SELDOM improves the 
joining process by pruning any possible two incoming paths. Also, it enhances 
the node leaving process by making limited rearrangement to the graph if that 
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node has a degree of two. The time complexity of both the nonrearrangeable 
and the rearrangeable SELDOM is 0(n 2 ). 
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Abstract 

We present a mechanism for providing state feedback information to multicast 
sources of multimedia streams in a scalable and robust manner. The presented 
feedback mechanism is suitable for best-effort unreliable networks such as the 
Internet. This mechanism is useful for controlling the transmission rate of mul- 
timedia sources in both cases of layered and single-rate multicast. It allows 
for determining the worst case state among a group of receivers, where each 
receiver may be in one of a set of finite states, and is applicable in receiver- 
driven as well as in sender-driven adaptive multimedia systems. Simulation 
results show that the presented feedback mechanism scales well for very large 
groups of up to few thousands of participants. The efficiency of the proposed 
mechanism in eliminating the reply implosion problem, its robustness in fac- 
ing network losses, as well as its responsiveness are illustrated. In addition, 
the advantages of the proposed mechanism over other feedback mechanisms 
are demonstrated. Moreover, adaptive enhancements for the mechanism are 
proposed to maintain its scalability for even larger groups. 



Keywords 

Feedback, multicast, adaptive multimedia applications. 



1 INTRODUCTION 

Multimedia streams are becoming a main component of modern distributed 
collaboration and tele-teaching systems. Most of these systems rely on IP 
multicasting in order to scale to large groups of participants. However, the 
quality of service (QoS) requirements of the multimedia streams demand spe- 
cial treatment. The main approaches taken for handling the requirements of 
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multimedia streams can be broadly classified as either proactive or reactive. 
The proactive approach relies on the existence of a resource reservation proto- 
col (Gupta et al 1995, Zhang et al 1993), and underlying scheduling mecha- 
nisms, to reserve and guarantee end-to-end resources. On the other hand, the 
reactive approach relies mainly on the ability of the application to adapt itself 
to the level of available resources (Bolla et al 1997, Bolot et al 1994, Cheung 
et al 1996, McCanne et al 1996). Most of these approaches, for handling mul- 
timedia streams, manage individual connections in isolation of others, which 
may lead to a state of competition for resources among streams belonging to 
the same session, thus decreasing the overall perceived session quality. Our 
approach, however, is to dynamically control the QoS offered by the sys- 
tem across the set of connections belonging to the application. This control 
is based on the application semantics, and focuses on maintaining the best 
overall quality of session, at every instant during the session lifetime. To this 
end, we introduced the concept of Quality of Session (QoSess) (Youssef et 
al 1997). 

In (Youssef et al 1998), we propose an architecture for a middle- ware 
platform, which supports collaborative multimedia applications by provid- 
ing QoSess control mechanisms. Conceptually the QoSess control layer acts 
as a closed loop feedback system that constantly monitors the observed be- 
havior of the streams, takes inter-stream adaptation decisions, and sets the 
new operating level for each stream from within its range of permissible op- 
erating points. Over wide area network connections, the QoSess control layer 
manages the resources that are collectively reserved, for the streams of a 
distributed application, by a resource reservation protocol, such as RSVP. 
Multi-grade streams are centric to the QoSess framework, in order to support 
heterogeneity of receivers and network connections. Multi-grade transmission 
can be achieved either by hierarchical encoding (McCanne et al 1997, Sen- 
bel et al 1997), or by simulcast which is the parallel transmission of several 
streams each carrying the same information encoded at a different grade (Li 
et al 1996, Willebeek-LeMair et al 1997). 

In this paper, we present one of the main building blocks of the QoSess con- 
trol layer: a scalable and robust state feedback mechanism. This mechanism 
provides the source of a multimedia stream with deterministic information re- 
garding the state of the receivers. The state of a receiver may be defined as the 
layers which it is interested in receiving from the source of a hierarchically en- 
coded stream. Given this knowledge, the sender can suppress or start sending 
the correct layers. The feedback mechanism is not only important for saving 
the sender’s host and LAN resources but for saving WAN resources as well 
in situations where the application’s addressing scheme for the layers does 
not permit the intermediate routers to suppress unwanted layers, or where 
the session is conducted over an Intranet whose subnets are inter-connected 
via low level switches that do not implement the IGMP protocol (Deering 
et al 1990) for suppressing unwanted multicast packets. Soliciting feedback 
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from receivers in a multicast group might create a reply implosion problem, 
in which a potentially large number of receivers send almost simultaneous 
redundant replies. We present a scalable and robust solution to this problem. 

The rest of this paper is organized as follows. In Section 2, the role of 
feedback in several adaptive multimedia multicast systems is illustrated. A 
brief survey of the different approaches for providing scalable feedback is pre- 
sented in Section 3. The proposed feedback mechanism is described in detail 
in Section 4, followed by a performance study and comparison in Section 5. 
In Section 6, adaptive enhancements for the proposed mechanism in order 
to support very large groups of receivers are described, and we present our 
conclusions in Section 7. 



2 FEEDBACK ROLE IN MULTIMEDIA MULTICAST 

Early attempts towards providing adaptive transport of multimedia streams 
over the Internet focused on the sender as the entity playing the major role 
in the adaptation process (Bolot et al. 1994, Busse et al. 1995). Information 
about the congestion state of the network, as seen by the receivers, was fed- 
back to the sender which used it to adapt to changes in the network state. 
In many cases, the monitored performance parameters (e.g., loss rate, delay, 
jitter, throughput) were mapped, by the receiver, to one of several qualitative 
performance levels, and reported to the sender (Bolot et al 1994, Busse et 
al 1995, Cheung et al 1996). The sender adapted its transmission rate by 
varying the quality of the transmitted media content by means of controlling 
several encoder parameters (e.g., frame rate, frame size, or quantization step 
for video streams). The sender often based its decisions on the worst case 
state reported (Busse et al 1995), and sometimes based it on a threshold 
of the number of receivers suffering the worst state (Bolot et al 1994). In 
this approach all receivers have to receive the same quality of multimedia 
streams regardless of the differences in their capacities and the capacities of 
the network connections leading to them. Although sometimes it is desired to 
maintain identical stream quality across all participants of a session (e.g., for 
some discrete media streams), yet this is not always the case especially with 
continuous media streams. 

The first approach, to address the need for providing a multi-grade ser- 
vice to participants of the same session, was represented by the introduction 
of the concept of simulcast (Li et al 1996, Willebeek-LeMair et al 1997). 
In a simulcast system, the sender simultaneously multicasts several parallel 
streams corresponding to the same source, but each is encoded at a different 
quality level. Each receiver joins the multicast group that matches its capa- 
bilities. Within a group, the same techniques of source adaptation, that were 
mentioned above, are applied within a limited range. Thus, the same feedback 
mechanisms are also deployed within each group. 

With the advent of hierarchical encoding techniques (McCanne et al 1997, 
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Senbel et ah 1997), a new trend in adaptive multimedia transport appeared in 
which the receiver plays the sole role in adaptation (McCanne et al 1996). In 
such systems the receiver is responsible for determining its own capabilities, 
and consequently, it selects the number of layers to receive from the hierar- 
chically encoded stream. The source, however, is assumed to be constantly 
multicasting all the layers. 

While it is very obvious that the layered encoding approach is more efficient 
in the utilization of resources relative to the simulcast approach, yet it is still 
debatable whether layered encoding techniques will be able to provide the 
same media quality as the simulcast encoders which operate in parallel, each 
optimized for a particular target rate. In spite of this debate, the layered 
approach is the most appealing from the networking point of view, due to its 
efficient utilization of network resources, especially bandwidth. However, this 
approach as described is not as efficient as can be. The fact that the source 
keeps sending at full rate, all layers, constantly, may lead to the waste of 
more resources than with simulcast, in the case where no receiver subscribes 
to some of the layers. On the other hand, augmenting this approach with a 
simple scalable feedback mechanism that provides the source with information 
regarding which layers are being consumed and which are not, yields more 
efficiency in resource consumption, as the sender can get actively involved in 
the adaptation process by suppressing the unused layers. 

The introduction of such a feedback mechanism, for receiver-oriented lay- 
ered transport of multimedia streams, is not only an added efficiency feature 
for such transport protocols, but it is also a critical feature for the success 
of collaborative multimedia sessions in which multiple streams are concur- 
rently active. In such collaboration sessions, multiple streams are typically 
distributed to all participants of the session, and the overall session quality is 
determined by the quality of each of the streams as well as by their relative 
importance and contribution to the on-going activity. In presence of scarce 
resources, it is logical to sacrifice the quality of one low priority stream for the 
sake of releasing resources to be used by a higher priority stream. Should the 
low priority stream source keep pushing all unused layers to the network, the 
decision taken by the receivers to drop these layers for releasing resources is 
rendered almost useless. This uselessness will hold true forever for the sender’s 
host and LAN, while the rest of the network may eventually have these re- 
sources released as the multicast routers stop forwarding the unused layers. 
In situations were the application’s addressing scheme for the layers does not 
permit the intermediate routers to suppress unwanted layers, WAN resources 
may also be wasted. 

In the former case, besides the unnecessary delay in releasing resources, 
the fact that the sender’s host and LAN will always be overloaded is very 
critical, as the session participants on this LAN may not be able to receive 
other higher priority streams. The problem is more crucial for Intranet based 
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collaboration systems since all the session participants (senders and receivers) 
are typically within a few hops from one another (Maly et al 1997). 

Moreover, since the sender may be sending only a subset of its layers, it 
needs to know about the existence of clients for higher layers that are currently 
suppressed, as soon as these clients subscribe to these layers. This information 
must be provided to the sender in a timely and scalable way that avoids poten- 
tial implosion problems in such cases when many clients subscribe to higher 
layers almost simultaneously. This is likely to happen when some streams are 
shutdown releasing resources that can be utilized by other active streams. 

Prom the above we conclude that a feedback mechanism is necessary for 
involving the sender in the adaptation process for receiver driven layered 
multicast of multimedia streams, especially in the context of collaborative 
multimedia sessions. Moreover, such a feedback mechanism is essentially the 
same as, and can replace, feedback mechanisms for supporting simulcast and 
single-rate multicasts. In the following section, we briefly describe the different 
approaches to providing scalable feedback, then in Section 4, we introduce the 
proposed scalable and robust mechanism for providing feedback in adaptive 
multimedia multicast systems. 



3 EXISTING SCALABLE FEEDBACK TECHNIQUES 

Soliciting information from receivers in a multicast group might create a re- 
ply implosion problem, in which a potentially large number of receivers send 
almost simultaneous feedback messages that contain redundant information. 
Typical solutions to address this problem include probabilistic reply , expand- 
ing scope search , statistical probing , and randomly delayed replies (Bolot et 
al 1994). 

Probabilistic reply: In a probabilistic reply scheme, a receiver responds to 
a probe from the source with a certain probability. If the source does not 
receive a reply within a certain timeout period, it sends another probe. This 
scheme is easy to implement. However, the source is not guaranteed to receive 
the worst news from the group within a certain limited period. In addition, 
the relationship between the reply probability and the group size is not well 
defined. 

Expanding scope search: In the expanding scope search scheme, the time- 
to-live (TTL) of the probe packets sent by the source is gradually increased. 
This scheme aims at pacing the replies according to the source capacity of 
handling them, since the source does not re-send the probe with increased 
scope until it has processed all previous replies. Clearly this is efficient only in 
the case where the receivers are uniformly distributed in TTL bands, which 
may not be the case. 
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Statistical probing: This scheme relies on probabilistic arguments for seal- 
ability. At the start of a round of probes (called epoch), the sender and each 
of the receivers generate a random key of a fixed bit length. In each probe, the 
source sends out its key together with a number specifying how many of the 
key digits are significant. Initially, all digits are significant. If a match occurs 
at a receiver then that receiver is allowed to send a response. If no response is 
received within a timeout period, the number of significant digits is decreased 
by one and another probe is sent. In (Bolot et al. 1994), it was shown that 
there is a statistical relationship between the group size and the average round 
upon which a receiver first matches the key. This scheme is efficient in terms 
of number of replies needed to estimate the group size. However, as shown 
in (Bolot et al 1994), the maximum response time (the time needed for the 
source to identify the worst case of all receivers) is equal to 32 times the worst 
case round trip time. For a typical worst case RTT of 500 milliseconds, it may 
take up to 16 seconds to find the worst case state of all receivers. 

Randomly delayed replies: In the randomly delayed replies scheme, each 
receiver delays the time at which it sends its response back to the source by 
some random amount of time. Clearly, the success of this scheme in prevent- 
ing the reply implosion problem depends to a great extent on the duration 
of the period from which random delays are chosen. However, the scheme is 
very appealing, in the sense that it allows for receiving responses from all the 
receivers in the group, if the delay can be adapted using some knowledge of 
the size of the group. 

From the above basic mechanisms, the randomly delayed replies approach, 
augmented with suppression of redundant replies and careful selection of de- 
lay periods, is the most appealing for two main reasons: first, a response is 
always guaranteed; and second, the response time is expected to be always 
low. This is the basic idea deployed in IGMP (Internet Group Management 
Protocol) (Deering et al 1990). In IGMP, the probe is sent to a local area 
network (LAN), and hence as soon as one of the receivers responds to the 
probe it is guaranteed that all the other receivers will hear that response and 
suppress their replies. Also, in such a local environment, the timeout period 
can be set to a fixed small value. In contrast, in our case, the group of receivers 
may be distributed over a wide area network (WAN), thus a reply sent by one 
receiver may not be heard by another before the other one emits its own reply 
which may be redundant. This implies the need for careful selection of the 
delay randomizing functions. 

A closely related, but different, problem is the negative acknowledgment 
(NAK) implosion problem associated with reliable multicasting. A solution 
for the NAK implosion problem, which is based on randomly delayed replies 
with suppression of redundant NAKs, is adopted by the SRM protocol (Floyd 
et al 1995). In SRM, when a receiver detects a lost packet, it randomizes 
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Figure 1 Overhead of session messages 



the delay before sending its NAK in the interval [C\d^ (i C\ 4- where 

di is the distance from receiver i to the source, Cl and C 2 are constant 
parameters. Both the NAK and the state feedback implosion problems are 
similar in the need for soliciting replies from a potentially very large group 
of receivers. However, with NAKs, whenever a data packet is lost on a link, 
all the receivers that the faulty link lead to will eventually detect the loss 
and send a NAK. Thus the distance between a receiver and the faulty link 
is the major factor that determines when the receiver will detect the fault, 
and consequently favoring closer receivers, by letting them send their NAKs 
earlier, implies suppression of more redundant NAKs. On the other hand, in 
the state feedback problem, the capacity of the receiver, and consequently its 
state, may not be related to its distance from the source. Therefore, a different 
criteria for randomizing the delays is required. 

In SRM, each receiver must determine its distance from the source to use 
it in the delay function. The overhead of session messages (typically RTCP 
reports (Schulzrinne et al 1996)) which are needed for that is not negligible. 
Figure 1, shows the overhead of RTCP reports for different session sizes and 
rates, assuming a single source. One of the objectives of the proposed feedback 
mechanism is to eliminate this high overhead, by designing the mechanism in 
a way that is not dependent on periodic session messages. 
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4 A SCALABLE FEEDBACK MECHANISM 

In this section, we describe the proposed mechanism for eliciting feedback 
information from the receivers in a multicast group. The objective of the al- 
gorithm is to find out the worst case state among a group of receivers. The 
definition of the worst case state is dependent upon the context in which 
the feedback mechanism is applied. It can be the network congestion state 
as seen by the receivers. This may be useful for applications where a similar 
consistent view is required for all the receivers, and the source is not capable 
of providing a multi-grade service, and hence must adapt to the receiver ex- 
periencing the worst performance. Another definition, of worst case state as 
seen by all receivers, is identifying the highest layer a receiver is expecting to 
receive in a hierarchically encoded stream. This allows the sender to adjust 
its transmission rate in order not to waste resources on layers that no receiver 
is subscribing to, and to start sending previously suppressed layers as soon 
as receivers subscribe to receive them. This is particularly important in the 
context of managing multimedia streams in collaborative sessions, because 
in such sessions the sender of a stream is typically simultaneously receiving 
multiple streams, and hence the assumption that the sender has abundant 
resources is not valid. 

In the rest of the paper, we assume that at every instant in time each 
receiver is in one state s, where s = 1,2, ..., H . H is the highest or worst case 
state, and the state of a receiver may change over time. 

We consider the general case when neither the group size nor the round-trip 
time from the sender to each receiver is known. As will be shown later, this 
information is not necessary as the mechanism estimates the average round 
trip time in the group, and uses it to adjust its timeout periods. 

In the proposed mechanism, the sender sends one type of probe messages, 
called SolicitReply messages, on a special multicast group which the sender 
and all the receivers join. The probe message contains a RTT field, which 
contains an estimate for the average round trip time from the sender to the 
group members. Upon receiving the SolicitReply probe, a receiver sets a timer 
to expire after a random delay period which is drawn from the interval 



RTT . RTT 

/(*)—=—, (Cif(s) + c 2 g(s))—— 



where f(s) and g(s ) are two non-increasing functions of the state s, C\ and 
C 2 are two parameters whose values are discussed later in detail. The receiver 
then keeps listening to the multicast group. If the timer expires, the receiver 
multicasts a reply message to the whole group. The reply message contains 
the state information as seen by this receiver (e.g., highest layer expected 
to receive in a hierarchically encoded stream). On the other hand, if the 
receiver receives another receiver’s reply before its timer expires and that reply 
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Figure 2 Distribution of timeout periods according to receiver state 



contains either the same or higher (worse) state, then the receiver cancels its 
timer and suppresses its own reply. This implies the need for careful selection 
of f(s), g(s ), Ci, and C 2 in order to avoid the reply implosion problem, while 
maintaining a low response time. In the subsequent subsections, we discuss in 
detail choices for f(s), g(s), C\, C 2 , and RTT. 



4.1 Selecting the timeout functions 

The objective of setting the timeout periods as a function of f{s), and g(s) is 
to distribute the timeouts as in Figure 2. Receivers in higher states randomize 
their timeouts over periods that start earlier than receivers in lower states, 
thus allowing for higher state responses to suppress lower state responses. 
In addition, the lower state receivers randomize their timeouts over longer 
periods relative to higher state receivers. This is because as time elapses and 
no responses are generated this means that the distribution of receivers over 
states is biased and more receivers belong to the lower states. Thus it is desired 
to randomize these condensed replies over longer periods. 

In order to meet these objectives, f(s) and g(s) must be non-increasing 
functions of s. Also, f(H) should equal 0 to avoid unnecessary delays in 
response time, while g(s) > 0 must be satisfied for all values of s to allow 
for randomization of timeout periods. We chose to make f(s) and g(s) linear 
functions in s in order to avoid excessive delays in response time, where f(s) = 
H - s, and g(s) = f(s) + k = H — s + k. 

The parameters C\ and C 2 scale the functions f(s) and g(s). C± controls 
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the aggressiveness of the algorithm in eliminating replies from lower state 
receivers, while C2 controls the level of suppression of redundant replies from 
receivers in the same state. The values of these two parameters are explored 
in depth in the following sections. The value of k is set to 1 . Selecting the 
value of k is not critical, since the parameter C2 scales g(s ), and the value 
of C2 can be tuned to optimize the performance of the mechanism given the 
selected value of k. 



4.2 Exploring the parameter space 

In this section, we attempt to find bounds for the ranges of operation of the 
parameters C\ and C2. Obviously, low values for C\ and C2 are desired in 
order to reduce the response time. On the other hand, excessive reduction in 
the value of either of the two parameters may lead to inefficiency in terms of 
the number of produced replies possibly leading to a state of reply implosion. 

In order to effect a shift in the start time of the timeout periods based 
on the state of the receiver, as in Figure 2 , C\ > 0 must be satisfied for all 
s < H. This shift allows for the high state replies to suppress low state replies. 
Similarly, C2 > 0 must be satisfied for all values of s, in order to allow for 
randomization of timeout periods for receivers belonging to the same state, 
thus enabling suppression of redundant replies which carry the same state 
information. 

To further bound the values of C\ and C2, we analyze two extreme network 
topologies, namely: the chain and the star topologies. Given a certain distri- 
bution of receiver distances from the sender, the feedback mechanism exhibits 
worst case performance when the receivers are connected in a star topology 
with the sender at its center. This is because connecting those receivers in 
a star topology maximizes the distance between any pair of receivers, to the 
sum of their distances from the sender, and hence minimizes the likelihood of 
suppression of redundant replies. On the contrary connecting those receivers 
in a chain topology minimizes the distance between any pair, to the difference 
between their distances from the sender, and hence maximizes the likelihood 
of suppression of redundant replies. Therefore, for a given distribution of dis- 
tances, and an arbitrary topology, the performance of the feedback mechanism 
lies somewhere in between the chain and the star cases. 

(a) Chain topology 

In the chain topology, the sender is at one end of a linear list of nodes. The 
rest of the nodes in the list are receivers. Let r = be a bound on the one 
way distance from the sender to any of the receivers or vice versa. Let the 
sender send a probe at time t. The farthest receiver receives the probe at time 
t + r. If this receiver is the only one in the highest state, and if it emits its 
reply as soon as it receives the probe, then all other receivers will have heard 
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this reply by time t + 2 r. In order to suppress all replies from lower state 
receivers in this case, C\ > 2 must be satisfied. C\ = 2 makes the difference 
between the start time of two successive states equal to 2r. 



(b) Star topology 

In the star topology, the sender is connected to each receiver by a separate link. 
Any message sent from one receiver to another passes through the sender’s 
node. Let all the receivers be at a distance r = from the sender. Thus 
the distance between any two receivers is equal to 2r. 

Let G s be the number of receivers in state s , and let T s be the first timer to 
expire for receivers in state s. The expected value of T s is (Cif(s) + C ^^ )r, 
since G s timers are uniformly distributed over a period of C 2 g(s)r. 

For receivers having the same state, if the first timer expires at time t, then 
all the timers that are set to expire in the period from t to t + 2r will not be 
suppressed, and all those that are set to expire after £ + 2r will be suppressed. 
Therefore, the expected number of timers to expire is equal to 1 plus the 
expected number of timers to expire in a period of length 2 r, which is equal 
to 1 + fy • Looking at the case of s = if, since g(H) = 1, then setting C 2 to 
any value less than 2 does not allow for suppression of any of the redundant 
replies from receivers in state H. Thus C 2 > 2 must be satisfied. 

In order to suppress all replies from receivers in state s — 1, we must have 



T s + 2r 

(Cif(s) + %§&)r + 2r 

9(s) _ p(s-l) 

G s G a - 1 



< 

< 

< 



7 1 1 

(Cj(s- l) + ^=ii)r, 

Cl -2 



For values of G s and G s - 1 which are relatively larger than g(s) and g(s — 1), 
we get Ci > 2, which is the same condition for C\ which we obtained from the 
chain topology. In Section 5, we explore the effect of C 2 on the performance 
of the feedback mechanism using simulation experiments. 



4.3 Estimating the round trip time 

To compute the average round-trip time from the sender to the group of 
receivers, every probe sent is time-stamped by the sender. That time-stamp is 
reflected in the reply message together with the actual delay period that the 
receiver waited before replying. This allows the sender to compute the round- 
trip time to this receiver. The smoothed average round-trip time, srtt , and 
the smoothed mean sample deviation rttvar are computed from the received 
round-trip time samples, using the same technique applied in TCP (Jacobson 
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1988), as follows: 

srtt = a srtt + (1 — a) sample , a = 7/8 , 

rttvar = /? rttvar + (1 — /3) | srtt — sample | , /3 = 3/4 . 

In TCP, the amount srtt + 4 rttvar is used in setting the retransmission 
timeouts in place of twice the round-trip time. As will be shown in Section 5, 
this amount is conservative and over estimates the average round-trip time 
to the group members. Instead we use only srtt as the estimate for average 
round-trip time. The recent value of srtt is carried in the RTT field of the 
next probe. 



5 SIMULATION STUDY AND PERFORMANCE COMPARISON 

In this section, we examine various issues, related to the performance and 
tuning of the feedback mechanism, using simulation. First we show the ability 
of the new feedback mechanism to eliminate the reply implosion problem as 
we explore the effect of C 2 on its performance. Then we examine the accuracy 
of the round trip time estimation algorithm. Finally, we further illustrate the 
scalability and robustness of the proposed feedback mechanism by contrasting 
it to an alternative candidate mechanism for feedback. 

In order to address these issues, we ran several simulation experiments. Each 
experiment was setup as follows. The group size, G, and the maximum round 
trip time, i?TT max , were selected. Round trip times uniformly distributed 
in the interval [0, RTT max ] were assigned to all the receivers, except the 
worst case state receivers whose round trip times were uniformly distributed 
in the interval [; t.RTT max , RTT max \, for investigating the effect of t over the 
performance, where 0 < t < 1. The number of states, if, was set to 5, and 
each receiver was randomly assigned one of these states. The choice of 5 states 
(or layers) is reasonable as the state of the art hierarchical video encoders 
typically provide a number of layers in this range (McCanne et al. 1996, Senbel 
et al 1997). Also, in applications where feedback information represents the 
perceived quality of service, typically 3 to 5 grades of quality are used (Bolot 
et al 1994, Busse et al. 1995). The feedback mechanism was simulated under 
the two extreme network topologies; the chain and the star. 



5.1 Bounding C 2 

From the analysis in Section 4.2, we obtained the two conditions C\ > 2 
and C 2 > 2. Setting C\ to its minimum value 2 eliminates replies from lower 
states, while avoiding unnecessary delays in response time. However, selecting 
an appropriate value for C 2 is not as easy as such. 

In Figure 3, the average number of replies is plotted for different values of 




Number of replies 
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Figure 4 The effect of C 2 on response time 



C 2 . The value of C\ was set to 2, for all the experiments in this section, and 
the average round-trip time was used in the RTT field of the probe messages. 
It is clear from the figure that the performance of the feedback mechanism is 
not sensitive to the value of C 2 in the case of the chain topology. Also, the 
figure shows that the reply implosion problem is totally eliminated. Moreover, 
over 95% of the redundant replies were correct replies (i.e., worst case state 
replies) which shows the robustness of the mechanism in facing network losses 
and its efficiency in eliminating non-worst case replies. This also means that, 
practically, the sender may safely react according to the first received reply. 
Figure 4 depicts the corresponding average response times. The response time 
is measured at the sender, and represents the time from sending a probe until 
receiving the first correct reply. The response time behavior is the same for 
both topologies because it is dependent on the round-trip times distribution 
rather than on the topology. As shown in the figure, it is bounded from above 
by the maximum round-trip time to the group members. 

These figures suggest that C 2 = 4 is a reasonable setup. C 2 > 4 does not 
significantly reduce the number of replies, while the response time increases. 
As can be seen from the figures, for typical sessions with up to 100 participants 
(e.g., IRI sessions (Maly et al 1997)), less than 10% of the receivers reply to a 
probe, in the worst case, while for larger sessions of thousands of participants 
the reply ratio is below 1.5%. 
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5.2 Evaluating the round trip time estimation technique 

As mentioned in Section 4.3, the amount srtt 4- 4 rttvar is used in setting 
the retransmission timeouts in place of twice the round trip time, in TCP. 
Figures 5(a) and (b) compare this approach to using only srtt as the estimate 
for average round trip time. We chose to avoid the conservative approach of 
TCP, and to use only srtt , to avoid unnecessary prolonging of delay periods 
thus avoiding excessive delays in response time. 



5.3 Performance comparison 

Here, we further illustrate the scalability and robustness of the proposed feed- 
back mechanism by contrasting it to an alternative candidate mechanism 
for feedback. The alternative mechanism uses the same approach taken by 
SRM (Floyd et al. 1995) for discriminating between receivers in setting their 
timeout periods based on their individual distances from the source (i.e. time- 
outs are selected from the interval [ C\di , (C\ + C 2 )di] where d* is the one way 
distance from receiver i to the source). This, in turn, depends on the existence 
of session level messages for the distance estimation process as explained in 
Section 3. 

Figures 6 through 8 contrast the performance of our proposed feedback 
mechanism, Ai, to the alternative feedback mechanism, A 2 . The comparisons 
are performed in two cases. In the first case, the worst case state receivers were 
distributed at distances in the range [0, RTT max ] (i.e., t=0). In the second 
case, the worst case state receivers were distributed at distances in the range 
[0.2 RTTmax, RTTmax ] (i.e., t=0.2). 

Figure 6 shows that the total messages sent in response to a probe in the case 
of the new feedback mechanism, A \ , is much lower than the total response plus 
session messages for the alternative feedback mechanism, A 2 . As discussed in 
Section 3, the session overhead for A 2 is dependent on the session bandwidth; 
we depict the two cases of 1Mbps and 5Mbps sessions. For A 2 , the session 
overhead assumed that an epoch (the time span from sending a probe until 
receiving the last possible reply) will take at most one second. 

Figure 7 shows that the number of messages carrying correct worst case 
state information constitute almost all the total messages sent in the new 
algorithm A \ . In A 2 , on the contrary, almost all the messages sent are overhead 
messages. This demonstrates the robustness of the new feedback mechanism 
and its tolerance to losses in reply messages. 

However, Figure 8 shows that the response time of A 2 is lower on the 
average. Nevertheless, this is not always the case for A 2 , as a slight shift in 
the distribution of receiver distances reverses this situation and makes the 
response time of A\ lower. This trend continues as t increases. 

From these charts, we conclude that A\ is much more robust than A 2 . Also, 
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Figure 8 Comparing response time for the star topology 
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the total overhead of A\ is always lower than that of A 2 up to sessions of few 
thousand participants. However, for very large sessions approaching 10000 
participants, and for certain distributions of distances of receivers, the over- 
head of A\ starts to rise significantly. This is true for star topologies which 
represent worst case performance for A\. For chain topologies, the perfor- 
mance of the algorithm was found to be significantly less dependent on the 
value of t. In the next section, we address the issue of enhancing the perfor- 
mance of Ai for very large sessions, and degenerate receiver distributions. 



6 ENHANCING THE FEEDBACK MECHANISM 

In this section, we present two enhancements for the feedback mechanism. 
These enhancements further improve the scalability of the feedback mecha- 
nism and reduce its overhead. 



6.1 Adaptive feedback 

In the previous section, it was shown that the performance of the proposed 
feedback mechanism needs some tuning to enhance its scalability for very 
large groups especially in the case when the worst state receivers are far from 
the sender, and most importantly far from each other. We focus on the worst 
state receivers because the outcome of the simulation experiments, discussed 
in the previous section, shows that almost all the excess replies that are gener- 
ated in these cases are redundant worst case replies. This means that the shift 
in the start time of the timeout periods is still effective in eliminating replies 
from lower state receivers. Thus the parameter C\ does not need tuning. It is 
the parameter C 2 which needs to be adapted to support very large groups. In 
other words, as the group size increases too much, the fixed value of C 2 = 4 
no longer suffices to effectively suppress enough redundant replies. To this end 
we developed a simple adaptive algorithm that the sender uses to adapt the 
value of C 2 dynamically based on the number of received redundant replies. 
The sender counts the number of redundant worst state replies in response 
to a probe in the variable dups. Note that based on our previous results, the 
sender can safely count all replies coming in response to a probe assuming 
they are all worst state replies. Before sending a probe, the sender computes 
a new value for C 2 and appends it to the probe message. This value is used 
by the receivers in computing their random timeout periods. The algorithm 
which the sender applies is as follows. 

AvgDups — a AvgDups + (1-a ) dups; 

If AvgDups > THRESHOLD 
C 2 = Min(C 2 +l, MAX-C2); 

Else 
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Figure 9 Effect of adaptive feedback 
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C 2 = Max(C 2 -l, MIN.C2); 

Figures 9(a) and (b) compare the performance of the static and adaptive 
feedback. In this simulation experiment, MIN.C2 , MAXXJ 2, THRESHOLD , 
and a were set to 4, 50, 25, and 0 respectively. The figures show the ability 
of the simple adaptive algorithm to reduce the number of redundant replies 
drastically, without significant delay in response time. The tradeoff, however, 
is that it takes the sender a longer time before it can declare that the current 
epoch is over and no further replies will be received. Typically, the sender 
sends a new probe only at the end of an epoch, to avoid overlapping replies. 
The sender can always safely terminate an epoch after an amount of time 
equal to (C\f{h) + C 2 g(h) + 2)—^- from sending a probe, where h is the 
highest state received in a reply to the current probe. After sending a probe, 
the sender sets a timer to expire after RTT plus the longest possible timeout 
period in the lowest state, for ending the epoch. As it receives replies, it ad- 
justs this timer according to the above equation which is linearly proportional 
to C 2 . 

A more aggressive approach for ending an epoch without relying on C 2 
would be to terminate the epoch after a period of time equal to RTT from 
the time of receiving the first reply. This aggressive approach safely assumes 
that any reply is coming from the highest state in the group. It attempts to 
give enough time for this reply to propagate to all other receivers and cause 
them to suppress their replies, if they haven’t already sent it. The approach 
relies on the heuristic assumption that RTT = RT7 jn™ , 

If it is desired to limit the bandwidth taken by the reply packets to R , then 
the THRESHOLD value can be set as a function of R. A simple approach 
is to set THRESHOLD = Repl R x Epoch duration. 



6.2 Passive feedback 

The feedback mechanism, as described, keeps polling the receivers all the time. 
As soon as the sender determines that an epoch has ended, it immediately 
sends the next probe. While these probes are important for synchronizing 
the operation of the mechanism and avoiding potential spontaneous chains of 
status change notifications from receivers, yet in situations where the states 
of the receivers are stable for relatively long periods of time, this repeated 
probing is unnecessary. 

One possible solution to optimize the performance of the feedback mecha- 
nism in such cases is to make the sender exploit the flexibility in spacing the 
probes, by increasing the idle time between ending an epoch and sending the 
following probe. However, this approach negatively affects the responsiveness 
of the feedback mechanism, especially when a change in state occurs after a 
relatively long stable state. 
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Another solution is to switch the feedback mechanism into passive mode 
whenever these relatively long stable states occur. When the sender gets sim- 
ilar state feedback from n consecutive probes, it sends a probe with a passive 
flag set, and carrying the current highest state h. Receivers do not respond 
to this probe, and the sender enters a passive non-probing mode. If a receiver 
detects that its state has risen above h, it immediately sets a timer in the 
usual way to report its state. On receiving a reported new higher state, each 
receiver updates the value of h. Similarly, if a highest state receiver detects 
that its state has fallen below h, it sets a timer in the usual way. However, 
when the receivers hear a report below h they do not update the value of h 
(as other receivers may be still in the h state). The sender, on receiving this 
report, switches back to the active probing mode, and the same cycle repeats. 

7 CONCLUSION 

In this paper, we presented a scalable and robust feedback mechanism for 
supporting adaptive multimedia multicast systems. Providing the source of 
a stream with feedback information about the used layers of the stream is 
crucial for the efficient utilization of the available resources. The feedback 
mechanism allows the sender to always send only layers for which interested 
receivers exist, and to suppress unused layers. 

Simulation results showed that the proposed feedback mechanism scales 
well for groups of up to thousands of participants. For typical sessions with 
up to 100 participants (e.g., IRI sessions (Maly et al. 1997)), less than 10% of 
the receivers reply to a probe, in the worst case, while for larger sessions, of 
a few thousands of participants, the reply ratio is below 1.5%. The response 
time was found to be always below the maximum round-trip time from the 
sender to any of the group members. 

The mechanism was shown to be robust in facing network losses, and to 
be more efficient than mechanisms which rely on session level messages for 
estimating individual round-trip times from each receiver to the sender. In 
addition, adaptive enhancements for supporting groups of up to 10,000 par- 
ticipants were proposed and shown to be effective in reducing the number of 
replies without a significant effect on response time. 

Currently, we are incorporating the feedback mechanism in the Quality 
of Session control platform described in (Youssef et al 1998), for further 
exploration and experimentation. 
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Abstract 

We propose a protocol that controls the members of a multicast group that 
send periodically status reports to all members. The protocol, called Multicast 
Access Protocol (MAP), limits the number of concurrent multicast reports as 
the group size becomes large. MAP is a decentralised protocol that provides 
an access control mechanism to an IP multicast group. The protocol supports 
members* joining and leaving the group dynamically as well as changes in 
the underlying network topology. MAP is a self-configuring mechanism and 
requires every member to keep only local information independent of the group 
size without using random timers. We describe the protocol both formally and 
informally. 
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1 INTRODUCTION 

We address the problem of limiting the number of concurrent members of a 
multicast group that send periodically status information to all other mem- 
bers. This problem arises in the Scalable Reliable Multicast protocol where 
each member must report the largest sequence number of the data packet that 
it has received from each sender to the whole group [1]. The periodical sending 
of reports does not scale to large groups since the number of reports received 
by a member is proportional to the number of group members. Multicast IP, 
as defined in [2], does not control the number of members in a given multicast 
group that can send data simultaneously. One solution to reduce the number 
of received reports is for every member to send its report to a server that 
combines, the received reports and send a combined report to all members. 
However the server becomes a bottleneck as the group size increases. To over- 
come this limitation Mark Handley [3] proposed to use multiple servers: each 
server receives the reports from a different subgroup of members and sends a 
combined report to its members and to the other servers. This scheme does 
not support well new members joining and leaving the multicast group as 
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well as changes in the underlying topology since the servers must be statically 
configured. An adaptive alternative is to use a two-level hierarchy [4] in the 
following way: each member reports to a local representative member elected 
using random timers which then sends a combined report to all other local 
representatives. Each representative sends then a combined report to all its 
local members. However in the absence of hierarchy this approach requires 
that each member maintains a table of distance to all other members in order 
to adjust the random timers. Using a two level hierarchy a member must only 
keep a table of distance to all other members having the same representative. 
Also in this case each representative is also required to maintain a table of 
distance to all other representatives. 

We propose in this paper a different approach called Multicast Access Pro- 
tocol (MAP) that does not require random timers. A member that uses MAP 
does not keep a distance table to all members but only to a subset of the mem- 
bers. Let d(i, j) be the number of hops that a multicast packet sent by member 
i goes through before arriving at member j . We assume that d(i,j) = d(j,i). 
We say that member i is a neighbor of member j if d(i, j) < d(k,j) for all 
members k such that < d(i,j). Thus a member keeps a table of dis- 

tances only to its neighbors. In the worst case a member keeps a table of 
distance to all other members. This worst case occurs only if d(i, j) = d(i , k) 
for all members ij,k. In general the number of members in the table of dis- 
tance should be independent of the group size. MAP can also be used in a 
two-level hierarchy but in this paper we restrict the description of MAP for 
the case of a flat hierarchy. 

MAP relies on the notion of a grant as defined in the SMART [5] protocol 
originally designed for ATM. SMART guarantees that, at a given instant, only 
one end-system is concurrently sending data on a given connection of an ATM 
multicast tree. In a similar way, MAP limits the number of concurrent report 
senders to a maximum number n which is fixed initially. The idea of applying 
SMART to IP was suggested to the authors by Jean-Yves Le Boudec [6]. 

To achieve this on multicast IP, MAP builds and maintains n spanning 
trees of the multicast group members, where n is the maximum number of 
reports that can be sent at the same time. All spanning trees of a given group 
are the same except that they can be directed differently. Each spanning tree 
is rooted and directed inward so that a member needs to keep track of one of 
its neighbor, called its parent The member of a spanning tree without parent 
is called root If a member of a spanning tree is a root and it is allowed to send 
a report then we say the member is a leader . A member and its neighbors 
exchange control informations by sending MAP control packets by unicast. 
Members retransmits state information if the protocol requires it and do not 
require a reliable transport protocol. 

The purpose of a MAP control packet is to change the root of a particular 
spanning tree and to ensure that every member becomes root of this tree 
in its turn. When a member becomes root of a spanning tree the protocol 
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determines if it is the leader of this tree and is thus allowed to send one 
report. Assuming that no member join or leave the group from time t = 0, 
if n(i, t) is the number of times member i sends a report during [0, £], then 
MAP ensures that | n(i,t) — n(j,t)\ < 1 for all members z, j, and time t. 

MAP constructs the spanning trees using the reports sent by the members. 
Every host is always free to join or leave the multicast group and the spanning 
trees are reconfigured accordingly. MAP ensures that adjacent members on the 
spanning trees are neighbors. The spanning trees are completely distributed 
and do not rely on any centralised coordinator. The spanning tree is built 
dynamically as members join and leave the multicast group. 

The protocol requires that every member keeps the following state informa- 
tion 4bits per spanning tree and 3 bits per spanning tree and per neighbor. 
MAP uses only one safety timer that guarantees the spanning trees are loop- 
frees. A MAP control packet carries 2 4- [logn"| bits where n is the maximum 
number of reports that can be sent at the same time. In the two following 
sections we assume that n = 1. The paper is organised as follows: Section 2 
gives an overview of the protocol for the case where it allows only one mem- 
ber to send a report concurrently. Then the protocol in the case it allows only 
one member to send a report concurrently is formally specified in Section 3. 
Section 4 lists some open issues. 



2 OVERVIEW OF THE PROTOCOL 

This section describes the general mechanisms of the MAP protocol for the 
case where only one report can be sent at the same time. A detailed specifi- 
cation of the protocol is given in Section 3. 

The construction of a spanning tree is shown in Figure 1. Hosts A and B 
join simultaneously a multicast group identified by address mcast. A and B 
start their timer T \ . When their timer T\ expires A and B are both leaders and 
multicast a report, see Figure 1(a). A solid arrow shows a report in transit 
whereas a dashed arrow indicates that the report was received. B receives 
the report of A and determines that A is its parent since B’s IP address is 
smaller than A’s IP address. The parent-child relationship is shown by a filled 
arrow pointing out of B towards A. A is the leader of the spanning tree A-B, 
see Figure 1(b). B sends a MAP control packet by unicast to A. Meanwhile 
host C joins the multicast group and starts its timer T\. A becomes the child 
of B and sends a MAP control packet to B. B receives the packet from A 
and becomes leader and multicasts a report, see Figure 1(c). Immediately B 
becomes the child of A and sends a MAP control packet to A. C receives the 
report of B. As soon as A receives the MAP packet from B, A becomes leader 
and multicasts a report, see Figure 1(d) . C receives the report of A. 

Timer T\ at C expires and C becomes child of A (Figure 1(e)) because 
d(A,C) <d(B,C). 
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(a) A and B 
multicast re- 
port 



(b) A is 
leader and 
parent of B 



A 



<a 





(c) B is leader and 
multicasts report, C 
joins the group 



(d) A is leader and 
sends report 



(e) C becomes child of 
A because d( A,C) < 
d(B,C) 



Figure 1 Construction of Tree A-B-C 



As more hosts join the multicast group, the spanning tree is dynamically 
updated as in the previous example. 

A MAP control packet contains a 2 bit sequence number. 

A member keeps two state variables: 

sn a sequence number that takes the values 0,1 and 2, and that take initially 
value 0. 

In the last value of sn at which the member was a leader. 

The access control to the multicast tree is shown in more details in Figure 2. 
The spanning tree consists of five hosts A, B, C, D and E. A is parent of B 
and C and B is parent of E and D. A is root of the tree A-B-C-D-E. Initially 
sn and In take the value 0, see Figure 2(a). A increments sn modulo 3 since 
no neighbor has its sn equal to 1 and 2, see Figure 2(b). A is now the leader 
since In = (sn+ 1) mod 3, sends a report and sets In := sn , see Section 3. The 
state values of A are showed in white on a black background as long as A is 
a leader. A sends a control packet with sn to C, see Figure 2(c). This control 
packet is a grant [5]. C receives the grant from A, becomes the root, sends a 
control packet to A to acknowledge the reception of the grant that carries the 
same sequence number as the grant. C increments its sn , becomes a leader 
and sends a report, see Figure 2(d). C sends a grant to A since it has no other 
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(a) D and E join tree, 
A is root 




(d) C is leader sends 
report 



A 




(g) B receives grant 
and becomes leader 



A 




(b) A increments sn , 
becomes a leader and 
sends a report 

A 




(e) C sends grant to A 



A 




A 




(c) A sends grant to C 




(f) A receives grant, A 
is root, A sends grant 
to B 

A 




(h) D is leader 



(i) E is leader 



Figure 2 Single Access Control on Tree A-B-C-D-E 



neighbor, see Figure 2(e). A receives the grant, becomes the root and sends an 
acknowledgement to C. A sends a grant to B since B has a sequence number 
equal to (2 + 1) mod 3, see Figure 2(f). B receives grant, becomes a root, sends 
an acknowledgement to A, increments its sn twice modulo 3 and becomes a 
leader, see Figure 2(g). Similarly D and E become leader and send a report, 
see Figure 2(h) and Figure 2(i). This example shows how the protocol ensures 
that every member sends a report in turn. 
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3 SPECIFICATION OF THE PROTOCOL 

This section presents a formal specification of MAP for the case where only 
one report can be sent at a time. 

We recall from Section 2 that a member keeps two state variables: 

sn a sequence number that takes the values 0,1 and 2, and that take intit ially 
value 0. 

In the last value of sn at which the member was a leader. 

A member keeps also a safety timer 7\ which is set to value that represents the 
maximum delay tolerated by a member to send a report. A second timer T 2 is 
used to resend a control packet to a parent if the grant or its acknowledgement 
were lost. In addition a member keeps a distance table which consists per 
neighbor of: 

• the IP address of the neighbor. 

• ^(member, neighbor) in number of hops. 

• a parent flag (one bit) that is set if neighbor is parent and else is reset. 

• a 2 bit sequence number received in the last MAP control packet from this 
neighbor. 

The four following functions are associated with the distance table. 

• neighbor(x) searches the table and sets the parent flag for the first neighbor 
it finds with a sequence number equal to x and returns true, else it returns 
false. 

• no_neighbor(x) searches the table and returns true if no sequence number 
equal to x was found, else it returns false. 

• reset_parent(x) searches the table, if there is a neighbor with parent flag 
set and sequence number equal to x, returns true and finds the neighbor 
with the parent flag set and ^(member, neighbor) is minimum, resets this 
neighbor flag and delete all other neighbors with parent flag set, else returns 
false. 

• delete_parent searches the table for a neighbor with the parent flag set and 
deletes the neighbor from the table. 

Each member that implements MAP is modelled as a finite state machine. 
A transition has a label of the form: if (condition) then actions. A transition 
is fireable if the condition is true. Fireable transitions are executed one at a 
time. When a transition is executed all its actions are executed as one atomic 
action. If several transitions are fireable at the same time one is chosen at 
random. The state machine is shown in Figure 3. 
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if parent((sn+ 1) mod 3) 
then sn := ( sn + 1) mod 3 



(Start) if T i expired 

then delete_parent 




if no_neighbor((sn -f 1) mod 3) 
then sn := (sn H- 1) mod 3 



Figure 3 The state machine per spanning tree 



State standby. When a new member joins the multicast group, it starts 
timer T\ and enters state standby. If a member receives a report while in state 
standby and its distance table is empty it inserts the sender as a new neighbor 
with sequence number equal to 0 and a parent flag reset. If the member re- 
ceives a report and the distance table has one entry, and ^(member, sender) < 
d(member, neighbor), the neighbor is deleted and the sender is inserted as a 
new neighbor with sequence number equal to 0 and a parent flag reset, else 
the distance table remains the same. When state is standby and timer T\ 
expires, the state changes to leader and only if the distance table is empty the 
member multicasts a report. 

State leader. While in state leader if there is no neighbor in the table with 
sequence number equal to (sn+ 1) mod 3, sn is incremented modulo 3. If the 
state is leader and there is a neighbor in the table with sequence number equal 
to (sn 4- 1) mod 3, sn is incremented twice modulo 3, timer T\ is started and 
send a control packet to neighbor with sn and the state changes to child. 

State child. If state is child and Ti expires, the parent neighbor is deleted 
from the table and the state changes to standby. While in state child if a 
report is received and ^(member, sender) < d(member, parent), the sender is 
added as a new neighbor in the table with sequence number equal to 0 and 
a parent flag set. If in state child and there is a neighbor with parent flag 
set and sequence number equal to (sn + 1) mod 3 then send a packet to the 
neighbor with sequence number (sn+ 1) mod 3 and starts timer Member 
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finds the neighbor with the parent flag set and cf(member, neighbor) minimum, 
resets this neighbor flag, delete all other neighbors with parent flag set, sn is 
incremented modulo 3, the state changes to root. If in state child and timer T 2 
expires resend a packet with sn to parent. If in state child and receive packet 
from parent with sequence number equal to sn , stop T 2 . 

State root. While in state root if there is no neighbor in the table with 
sequence number equal to (sn + 1) mod 3, sn is incremented modulo 3. If the 
state is root and there is a neighbor in the table with sequence number equal 
to (sn + 1) mod 3, sn is incremented twice modulo 3, send a control packet 
to neighbor with sn and the state changes to child. If the state is root and 
In = (sn + 1) mod 3, In := sn and the state changes to leader. 



4 OPEN ISSUES 

MAP presents an alternative to the session message mechanism in SRM [4]. 
The two approaches should be compared in a flat hierarchy scenario as well 
as for a two-level hierarchy. It is not clear also which of the two alternatives 
will be more robust to packet losses. Also many interesting questions remains 
to be answered about MAP such as the overhead for all members to send one 
report each as well as the associated latency. 



5 CONCLUSIONS 

We have proposed in this paper a protocol that controls the members of a 
multicast group that send periodically status reports to all members. The 
protocol, called Multicast Access Protocol (MAP), limits the number of con- 
current multicast reports as the group size becomes large. MAP is a decen- 
tralised protocol that provides an access control mechanism to an IP mul- 
ticast group. The protocol supports members joining and leaving the group 
dynamically as well as changes in the underlying network topology. MAP is 
a self-configuring mechanism and requires every member to keep only local 
information independent of the group size without using random timers. As- 
suming that no member join or leave the group from time t = 0, if n(i,t) is 
the number of times member i sends a report during [0, t], then MAP ensures 
that | n(z, t) - n(j , t) \ < 1 for all members i , j, and time t. MAP was showed to 
be free of deadlocks by extensive simulations using the model checker SPIN 
[7], [8]. including in case of message loss, duplication and reordering. An imple- 
mentation was tested successfully in a LAN environment. A complete proof 
of correctness of a protocol is left for further study. 
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Abstract 

Recently, some problems related to using the Real-time Control Protocol (RTCP) 
in very large dynamic groups have arisen. Some of these problems are: feedback 
delay, increasing storage state at every member, and ineffective RTCP bandwidth 
usage, especially for receivers that obtain incoming RTCP reports through low 
bandwidth links. In addition, the functionality of some fields (e.g. packet loss 
fraction) in the Receiver Reports (RRs) becomes questionable as, currently, an 
increasing number of real-time adaptive applications are using receiver-based rate 
adaptive schemes instead of rate adaptation schemes based on the sender. 

This paper presents the design of a scalable RTCP (S-RTCP) scheme. S-RTCP is 
based on a hierarchical structure in which members are grouped into local regions. 
For every region, there is an Aggregator (AG) which receives the RRs sent by its 
local members. The AG extracts and summarises important information in the RRs, 
derives some statistics, and sends them to a Manager. The Manager performs 
additional statistical analysis to monitor the transmission quality and to estimate 
regions which are suffering massively from congestion. 

We believe that our S-RTCP alleviates some of the RTCP scalability problems 
encountered in very large dynamic groups and makes effective use of RRs with 
regard to the current changing requirements of real-time adaptive applications in the 
Internet today. 

Keywords 

RTCP scalability, RR, TTL, AG, LAG, AGR, Manager 
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1 INTRODUCTION 

Today, the Real-time Transport Protocol (RTP) is widely deployed in most MBone 
applications over the Internet involving multiple senders and receivers. The Real- 
time Control Protocol (RTCP) which is RTP's control protocol is used mainly in 
adaptive applications where the sender changes its rate of data transmission in order 
to suit the current state of the network. 

Some problems have arisen when RTCP has been used in very large dynamic 
multicast groups (Rosenberg, 1997a), (Rosenberg, 1997b) , (Schulzrinne, 1997). 
Firstly, because the RTCP reporting interval grows linearly with the group size, a 
feedback delay occurs. Consequently, infrequent feedback reports are sent and timely 
reporting does not occur. Second, each member has to keep track of every other 
member in the group, thus a storage scalability problem can appear. Third, a flood 
of initial RTCP reports multicast to the whole group can occur when large number 
of members join at the same time. As a result, members can be flooded with these 
reports, especially the ones connected to the network through low bandwidth links, 
and the network may be congested. This problem occurs (Aboba, 1996) if the 
reporting members are not implementing the reconsideration algorithm described in 
(Rosenberg, 1997a). Fourth, for receivers connected through low bandwidth links, 
the RTCP bandwidth available could be used more effectively than is presently the 
case. 

Today, in the Internet, some of the requirements for real-time adaptive 
applications are changing. An increasing number of the current multicast 
applications prefer to use receiver-based rate adaptation schemes instead of sender- 
based rate adaptation schemes to adapt to congestion in the network. In sender-based 
rate adaptations, when congestion occurs, the sender decreases its rate of data 
transmission to suit the receiver with the lowest capabilities. Receiver-based 
adaptive applications have the advantage of accommodating to the heterogeneous 
capabilities and conflicting bandwidth requirements of different receivers in the same 
multicast group (McCanne, 1996). Also, adaptation is done immediately instead of 
waiting for the sender to adapt. With the appearance of these kinds of receiver-based 
applications, we ask the question: what is the function of the RTCP Receiver 
Reports (RRs) and how can RRs be used effectively in the current Internet? 

We designed a hierarchical scheme which groups members in local regions. 
Members in each local region send their RRs locally to an Aggregator (AG) in the 
same region. The AG summarises important information in the RRs, derives some 
statistics, then sends this information to a Manager. The Manager does some 
monitoring and diagnosis functions to estimate which regions are suffering highly 
from congestion and to evaluate the quality of the transmitted data. 

This paper proceeds as follows. In Section 2, some background information 
about RTP/RTCP functionality is presented. Section 3 presents some of the 
scalability problems of RTCP feedback reports. In Section 4, we describe our 
Scalable RTCP (S-RTCP) scheme. Section 5 presents some of the benefits of using 
S-RTCP. Finally, we summarise the current status and outline future work. The 
present paper expands on the rationale and description given in (El-Marakby, 1998). 
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2 BACKGROUND 

RTP, the Real-time Transport Protocol, is mainly used for real-time transmission 
of audio and video over the Internet in multicast and unicast modes (Schulzrinne, 
19%). It provides several functions: 

• Identification of payload data type to identify the format of the payload data. 

• Sequence numbering to detect data packet loss and out-of-order packets. 

• Timestamping so that data is played out at the right speeds (Rosenberg, 1997a). 

RTCP, RTP's Real Time Control Protocol, is used in monitoring the Quality of 
Service (QoS) of data delivery and in conveying minimal session control 
information to all members in an audio/video RTP session. 

Both RTP and RTCP are integrated within the application such as the MBone 
video and audio applications (e.g. vie and vat). 

RTCP has five types of report that are periodically sent to all members in the 
session. The most important are the feedback reports, namely the Sender Report 
(SR), and the Receiver Report (RR). SR and RR differ only in that the SR is issued 
by a receiver which is also a sender whereas the RR is issued by a receiver which is 
not a sender. Both SR and RR contain performance statistics on the total number of 
packets lost since the beginning of transmission, the fraction of packet loss in the 
interval between sending this feedback report and sending the previous one, the 
highest sequence number received, jitter, and other delay measurements to calculate 
the round-trip feedback delay time. The SR provides more statistics summarising 
data transmission from the sender, e.g. timestamps, count of RTP data packets, and 
number of payload octets transmitted. 

The SR and RR have several functions. RRs are used mainly in sender-based 
adaptive applications. The sender can modify its transmissions dynamically based on 
the RR feedback it receives from its receivers. The packet loss parameter in the RRs 
has been used as an indicator of congestion in the network. So, after receiving the 
RRs from the receivers, the sender may increase or decrease its rate of data 
transmission according to the packet loss fraction it received within the current 
interval. This rate adaptation helps to reduce network congestion and adapts to 
changing network conditions (Bolot, 1994), (Busse, 1996), (El-Marakby, 1997a), 
(El-Marakby, 1997b). The SR is useful in lip-synchronisation (inter-media 
synchronisation) and in calculating transmission bit rates. Both SR and RR 
feedback can be used by a third-party monitor which does not receive RTP data but 
only RTCP packets. This monitor can be an Internet Service Provider (ISP) or a 
network administrator. It monitors performance of the network and diagnoses its 
problems (Schulzrinne, 1996). 

The other three types are the Bye report which is used when a member is leaving 
the session, the Application-defined RTCP packet (APP) report which is used for 
experimental use with no official packet type registration, and the Source 
Description (SDES) report which provides identification information about all 
members in the session. 

In the next section, we shall describe some of the RTCP scalability problems. 
Then we will discuss the functionality of RRs with respect to current requirements 
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of real-time adaptive applications in the Internet today. 

3 RTCP SCALABILITY PROBLEMS AND FUNCTION OF RRs 

The feedback provided by Internet applications has proved to be useful as no special 
support is needed from the network to detect its current state. The RTCP feedback is 
used in adaptive applications as well as in monitoring. 

RTCP scales well for small multicast groups but a scalability problem arises 
when it comes to a group of thousands of users. Some of these problems are 
addressed in (Rosenberg, 1997a), (Rosenberg, 1997b). 

3.1 RTCP feedback problems in a large dynamic group 

We will explain first how the interval between RTCP packet transmissions is 
calculated. All RTCP reports multicast to all members in the group must not 
consume more than a small fraction (nominally 5%) of the whole bandwidth 
assigned for the session (Schulzrinne, 1996). Hence, every member has to store an 
estimate of the size of the group by counting distinct RTCP reports sent to the 
multicast group. Consequently, members scale back their RTCP reporting interval 
based on the group size they calculated. That is to say, as the group size increases, 
each member increases its reporting interval and as the group size decreases, every 
member decreases its reporting interval. As a result, the bandwidth limit for RTCP 
reports does not exceed 5% of the whole session bandwidth regardless of changes in 
the group size at any time during the session. 

The following are some of the problems encountered when using RTCP feedback in 
a large dynamic group: 

Feedback delay 

The feedback should be sent periodically within acceptable time intervals. In a large 
RTCP group, this does not happen. Feedback is sent very rarely or not at all. This 
happens because the RTCP reporting interval grows linearly with the group size. 
So, as the group size increases, the RTCP interval increases resulting in infrequent 
RTCP feedback reports which decreases the significance and value of the feedback 
(Rosenberg, 1997a). 

Increasing storage state 

In order to calculate the size of the group, every member has to store a count of 
distinct members it heard from during the session (Rosenberg, 1997a). So as not to 
count duplicate members, the unique Synchronisation Source identifier (SSRC) 
found in the RTP header is stored for every distinct member. Of course, storing all 
the distinct SSRC identifiers for a large group causes a storage scalability problem 
for every member. 

This problem was discussed in (Schulzrinne, 1997) where a SSRC sampling 
algorithm is described. 
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Multicasting RRs to the whole group 

RRs are multicast to every member in the session. As mentioned before, RRs are 
mainly used by the senders for adaptation. So, it seems there is very little benefit 
having each member send its RRs to every other member in the group which are 
not senders. In addition, members connected to low bandwidth links would not want 
part of their bandwidth to be used by incoming RRs when this bandwidth usage 
would be of little or no advantage to them. Moreover, the processing load at every 
member may increase because of processing incoming RRs from other receivers. 
Furthermore, if congestion occurs in the network, it is more likely to affect local 
members near the congested link. Hence, their RR feedback reports will be more or 
less similar and hence by decreasing the number of redundant RRs that are 
multicast, congestion can be reduced (Aboba, 1996). 

Initial feedback flood 

When a very large number of members join simultaneously (e.g. at the start of a 
MBone multicast session announcement) (Rosenberg, 1997a), it will not be 
possible to get an accurate estimate of the group size. Each member's first 
estimation of the group size is 1, and so all the RTCP reports are sent within a 
fixed initial interval. Consequently, congestion can occur in the network and 
especially at low bandwidth links of some members. In addition, the feedback 
reports sent by other members may be dropped due to congestion. This results in 
inaccurate estimation of the group size which depends on counting the reports 
coming from distinct members. Hence, it will take a long time to converge to a 
fairly accurate estimation of the group size and thus to an appropriate RTCP 
interval computation. 

This problem of initial RTCP feedback reports was solved by Rosenberg and 
Schulzrinne (Rosenberg, 1997a), by applying a reconsideration algorithm. Members 
listen to other members in the group before sending their initial feedback reports. 
Consequently, the reporting interval is readjusted before sending the first feedback 
report. 

Bye flood 

When a RTP member leaves the group, it multicasts a RTCP bye packet to the 
whole group. The problem occurs if many users leave the group at the same time. 
As a result, a flood of Bye packets that may congest the network occurs. The 
problem was fixed in (Rosenberg, 1997b) by applying a Bye reconsideration 
algorithm. 

3.2 Functionality of RRs 

The Internet is a heterogeneous network. Network resources are varied throughout, 
and users can have different capabilities specifically link bandwidth. One of the most 
important functions of RR feedback reports is their usage in adaptive applications. 
By using the packet loss fraction in RRs, the sender can detect network congestion. 
Hence, the sender changes its rate of data transmission to adapt to changing network 
conditions and to help reduce congestion. This technique has proved to be useful for 
unicast applications. However, for multicast applications in the heterogeneous 
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Internet, the sender ends up decreasing its data transmission rate to suit the receiver 
with the lowest capability. Consequently, the sender will not be able to meet the 
various bandwidth requirements of different receivers in the same multicast group. 

To accommodate to this heterogeneous environment and to scale to very large 
number of receivers, a receiver-based rate adaptation scheme is used (McCanne, 
1996). The sender sends the data on separate multicast groups. The groups are 
ordered such that each provides refinement information over the previous group to 
give increased quality. The receivers can subscribe to one up to all of these 
multicast groups according to each one’s capabilities and according to the current 
state of the network. Hence, receivers adapt to congestion by joining or leaving 
multicast groups. 

Nowadays, in the Internet, we see a great movement towards receiver-based 
adaptation schemes. So, in these applications, the packet loss parameter, which is 
the most significant parameter in RRs, becomes of less or no significance to the 
sender as adaptations are performed from the receivers immediately without waiting 
for the sender to react. 

In the next section, we describe the architecture of our scheme. 

4 OUR SCALABLE RTCP SCHEME 

In this section, we describe how locally scoped regions are formed in a hierarchical 
way. Then, we explain in details the functionality of the Manager. 

4.1 Overall view 

RTCP feedback reports are multicast mainly for receivers to calculate the group size 
and thus compute their RTCP reporting interval. In our scheme, the members do 
not need to compute the whole size of the multicast group and RRs are not 
multicast. 




Figure 1 Structure of our scheme showing members in local regions with an 

AG (shadowed circle) per region and a Manager (M) at the root of the hierarchy. 

Figure 1 depicts the structure of our scheme which organises members 
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dynamically in a multi-level hierarchy of local regions. Each region has an 
aggregator (AG). Local members send their RR feedback reports with limited scope 
to reach their own AG which gathers and aggregates statistics from these reports 
which it passes to a Manager. The Manager computes additional statistics to 
evaluate the transmission quality and to estimate the regions which suffer from 
congestion. 

Our scheme makes use of the Time-to-Live (TTL)* field in the IP header to allow 
us to build the multi-level hierarchy with locally scoped regions. We are aware of 
the problems when using TTL scoping with the Distance Vector Multicast Routing 
Protocol (DVMRP) (Meyer, 1997). We chose to use TTL scoping because it is 
simple. 

4.2 Scheme entities: 

The following are the entities of our Scalable RTCP (S-RTCP) scheme: 

• Member: A member is a sender or a receiver in the same RTP session. 

• LAN Aggregator (LAG): The aggregator for a LAN is also a member which 
represents only local members in the LAN. It aggregates RRs from members in 
its LAN; it then reports to the Manager. 

• Aggregator (AG): The AG is also a member, but it also aggregates RRs from 
members in its local region (i.e. its children); it then reports to the Manager. 
The children of an AG can be normal members, AGs, or LAGs. Every AG has 
a level in the hierarchy. For example, in Figure 1, AG1 is an AG of level 1, 
while AG2 is an AG of level 2 which is a child of AG1. 

• Aggregator Report (AGR): This is a new RTCP report of type AGR. Every 
AG/LAG sends AGRs to the Manager to summarise the quality received by 
local members during different intervals. 

• Manager: This performs some monitoring and diagnosis functions. It receives 
AGRs. It is also an AG of level 0 (AGO) as it is at the root of the hierarchy. It 
receives RR feedback from its direct children who are neither AGs nor LAGs, 
while it receives AGRs from all the other aggregators in the hierarchy. The 
Manager should be connected to the network through a fast bandwidth link. 

The following subsection provides a detailed explanation of the mechanisms of 
our scheme. 

4.3 Scheme description 

When starting the RTP session, two multicast addresses are announced; the first 
address is for the delivery of RTP data packets, while the second one is for 
transporting RTCP control packets. Then, the Manager joins the control multicast 
group. It receives only the RTCP control packets and not the data packets. It is also 
the first AG in the multi-level hierarchy (AGO). Afterwards, senders and receivers 



This is an integer field in the IP packet header for constraining the travelling distance 
of the packet. The source initialises the TTL field with an appropriate initial value 
according to the distance it wants the packet to travel. Each router decrements this TTL 
by 1 when the packet arrives. The router discards the packet if the TTL reaches zero. 
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join the two multicast groups for data and control. 

Figure 2 depicts the multi-level hierarchical structure of local regions. 




Figure 2 Organisation of members in local regions where each local region has 
an AG of a certain level in the multi-level hierarchy. The shaded circles represent 
AGs, while blank ones represent normal children. Also, a LAG (connected to a 
LAN) is shown. 



Selection of Parent AG and formation of local regions 

A new member will perform an expanded ring search (Yavatkar, 1995). The new 
member will repeatedly search for a Parent AG by increasing the TTL value until it 
finds a near Parent. First, it multicasts a " S earch_for_Parent M request (see figure 3) 
with a small TTL value greater than 1 as it is a well known convention that a 
multicast packet with TTL=1 is sent only to members in the same LAN. If no reply 
was received after some time, then it will do another expanded ring search but with a 
greater TTL value, and so on, until it receives reply(ies) from existing AG(s) of 
which one will be its future Parent AG. 




Figure 3 Messages interchanged between child and Parent during the process of 
searching for a Parent. 

Each Parent AG stores the current number of its direct children which includes 
children acting as AGs or normal children (i.e. not AGs). In addition, each Parent 
stores the maximum number of children, it is allowed to have, initially obtained 
from the Manager. That is to say, when the very first new members join at the 
beginning of the session, the only Parent AG by that time is the Manager which is 
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AGO. If these new members become AGs, the Manager will pass to them these 
numbers and other information. Then, these AGs might become parents and pass 
the information they obtained initially from the Manager to their children AGs and 
so on. 

Upon the receipt of the request "Search_for_Parent", every Candidate Parent AG 
runs the following tests: 

Tes t 1: 

If this Candidate Parent can afford to get more children (i.e. it currently has fewer 
children than the maximum number stored) 

then 

This new member is considered as a Potential Child; 

Go to Test 2. 
else 

It will not send any reply back to the new member; 

Exit (i.e. Test 2 is skipped). 

Test 2: 

If (the level of this Candidate Parent < N) + AND 

(the distance from this new member > certain threshold) 

then 

This member is considered as a Potential Child AG. 
else 

Exit. 

The following is a detailed explanation: 

• In Test 2, the Candidate Parent checks whether its level in the hierarchy is less 
than N. If its level is N, this means that the new member cannot be an AG. 
Consequently, the height of the tree will be limited. We estimate that a suitable 
height is N=3. In addition, the Candidate Parent checks if the distance between 
the new member and itself is greater than a certain value; if greater, then this 
new member can be an AG, otherwise it will remain to be just a child. This 
value should not be small as we do not want the new AG to be very close to its 
Parent AG. This value is passed initially by the Manager to AGs of level 1 
which pass it to AGs of level 2 and so on. 

• Note also, if an AG belongs to that Parent but does not have any children, then 
the Parent can replace a new AG instead of that old one which returns to being 
a normal child. This is not mentioned in Test 2 for simplicity. 

• If a LAG wants to join, it will be accepted right away by the Parent AG no 
matter how many children it has. 

After performing these tests at the Parent, if this new member is a Potential 
Child , then the Parent will send a reply "PotentiaLChild_Acceptance u . The reply 
will contain the Parents IP address and a TTL value of 255. Furthermore, if this 
new member is a Potential Child AG, then the reply will include also the Parent's 
level in the hierarchy, the maximum number of children this new Potential AG will 
be able to have, the minimum distance allowed between this new member and its 
future direct AGs, and the minimum and the maximum thresholds for measuring 



+ N is the height of the hierarchical tree. It is equal to the level of the tree + 1. 
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packet loss (to be explained later). 

If there is more than one Candidate Parent for this new member, i.e. it receives 
more than one reply, then it will choose its Parent as the one whose reply carries 
the largest TTL (i.e. shortest distance from Parent to this new member). That is, as 
mentioned before, every Candidate Parent sends an initial TTL value of 255 in its 
"Potential_Child_Acceptance M reply so that the child can choose that Parent whose 
reply carries the remaining largest TTL. Note that in the current phase of our work 
we are measuring the distance in terms of number of hops only and not delay time. 
The new member will store all the values in the "Potential_Child_Acceptance M 
reply sent by the selected Parent. If more than one reply is received from different 
Parents, the new member will send "Reject_Parent M to those Parents not selected. 

Then, the new member will unicast "Accept_Parent" reply to the selected Parent. 
The Parent will store the remaining value from the original TTL sent by the child 
in M Accept_Parent M as an indicator of its distance from that child. Hence, it can 
detect the distance to the furthest child. In addition, it will increase the number of its 
children by one. 

This restriction of the maximum number of children is an attempt to balance the 
load of the members among local regions. The Manager is the only AG that does 
not have this restriction. So if all AGs have their maximum number of children, 
then any new member will be a child of the Manager. In addition, LAGs do not 
have restrictions on the number of members in their LANs. 

Hence, most members may end up in the vicinity of their nearest AG but this is 
not always the case. 

AG leaving or crashed 

Every Parent AG multicasts periodically a local refreshment message to its children 
with TTL=the stored TTL of the furthest child from it. This message shows that the 
Parent is still alive and not crashed. If the Parent wants to leave the group, it will 
multicast locally a Bye packet. Whether the Parent crashed or is leaving, every child 
will start again the process of searching for a Parent AG through expanded ring 
searching. 

An exceptional case arises when the child is an AG. As mentioned before, every 
AG can have a maximum number of children. In addition, every AG can accept a 
maximum number of additional children AGs that are not its own children only if 
their Parent crashes or leaves. Hence, if a Parent crashes or leaves the group, then 
every child AG of this Parent will search for the nearest Parent. If the nearest Parent 
can accept more AGs of other Parents, then this child AG will take it as its new 
Parent, otherwise the child AG will expand its ring searching scope to search for 
another Parent. Note that this exceptional test was not mentioned before for 
simplicity. 

Choice of a LAN Aggregator (LAG) 

If one or more members in the same LAN are participating in the same RTP 
session, a LAN Aggregator (LAG) is chosen to aggregate information from all the 
members in the LAN. The process of choosing a LAG depends on scoped multicast 
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discovery queries to locate a LAG for the LAN. 

When a new member in the LAN joins the session, it will send a query packet 
"Search Jbr_LAG" to search for an existing aggregator for this LAN. If this exists, 
the LAG will send a reply "LAGJBxists" that contains its IP address to be stored 
afterwards by the new member. 

If no reply is received within some time, then the new member will consider 
itself to be the first member in this LAN for this RTP session and will elect itself 
as the LAG. Then, it starts searching for a Parent AG (see the previous 
subsections). Afterwards, the Parent AG will pass to the new LAG the minimum 
and the maximum thresholds for measuring packet loss. These parameters are used 
when summarising RRs received from members in its LAN. 

LAG crashing or leaving 

The LAG multicasts periodically "LAGJBxists" refreshment message to other 
members in the LAN to inform them that it is alive (Papadopoulos, 1998). If this 
message is not received within some time, the LAN members will assume that their 
LAG crashed. 

If the LAG leaves the group, it will multicast locally a RTCP Bye message. It 
will also unicast to its Parent AG that it is leaving. 

Whether the LAG crashed or left the group, each member will start the process of 
choosing a new LAG for their LAN. Each member will try to multicast locally a 
"Want_to_beJLAG" request. Each member will use a randomised back-off timer and 
when the timer expires for one of the members, it will immediately multicast 
locally a "LAGJBxists" message containing its IP address. Upon receipt of this 
message, the other members in the LAN will suppress their M Want_to_be_LAG M 
request and accept this member as their new LAG, then store its IP address. This 
randomised back-off scheme prevents the flood of the "Want_to_be_LAG" requests 
if all members multicast at the same time and resolves the problem of choosing a 
new LAG by directly selecting the LAG whose timer expires first. 

SR and Bye RTCP reports 

Our scheme deals mainly with the Receiver Reports (RRs). By limiting their 
travelling distance and summarising important statistics they include, we improve 
the RTCP scalability. The Sender Reports (SRs) will still be multicast periodically 
from the sender to the receivers in the session. Note that SRs will not include 
receiver reporting within them. 

Bye reports are sent as follows: 

• If a child is leaving, it will send a Bye packet to the Parent AG. 

• If an AG/LAG is leaving, it will multicast locally to its children a Bye packet. 

4.4 Contents of an AGR 

Each AG receives the RR feedback from its direct children which are not AGs or 
which are AGs with no children. RRs are sent by local members within a certain 
time interval that is randomised but not to exceed some fixed amount of time. The 
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AG organises the information, derives some statistics, and includes them into an 
AGR which reports the quality received by the receivers within a certain interval. 
Note that the statistics are computed from the RRs of every receiver to a specific 
sender in the multicast data group. 

The QoS statistics and other information contained in an AGR are described 
below. Some of the functionality of these statistics is explained in the following 
subsection. The statistics are: 

• Number of children that this Parent AG is summarising in the current interval. 
This number includes only children that send RR feedback. 

• Number of children, within the current interval and from the beginning of the 
data transmission, whose: 

• packet loss exceeds the maximum threshold; 

• packet loss lies between the minimum and the maximum thresholds. 

• IP address of the Parent of this AG. 

• IP address of the child receiving the worst quality (i.e. which has the highest 
packet loss in this interval) and the value of packet loss incurred. 

• Average, median, and standard deviation of packet loss, in the current interval 
and since starting transmission, that: 

• surpasses the maximum threshold; 

• lies between the minimum and the maximum thresholds. Note that in order 
to calculate the median, the AG will sort the packet loss according to the 
maximum packet loss incurred by every child. 

Once all these measurements are computed, they are included in an AGR. Note 
that these measurements are to evaluate the quality received from one sender. If 
another sender exists during this interval, then the same measurements are calculated 
from RRs of local receivers of this other sender. Then, the measurements are 
appended to the AGR but related to the other sender. 

Then, the AG unicasts the AGR to the Manager. In case the aggregator is a 
LAG, it will send its AGR directly to the Manager too. In the current phase of our 
work, the AGR is sent directly to the Manager and not to the Parent AG that can 
pass it to its Parent AG and so on until it reaches the Manager. This is because we 
do not want to have a long feedback delay until the AGR reaches the Manager. 

The following subsection describes the functionality of the Manager. 

4.5 Functionality of the Manager 

The Manager monitors the data distribution in the multicast group and performs 
some diagnosis functions. It collects and parses the information received from the 
AGRs during every interval. Then, it logs useful statistics. By making use of 
information in AGRs, it can estimate whether problems are specific to a certain 
region or several regions or to all regions of the whole multicast group. 

By obtaining the number of children whose packet loss exceeds the maximum 
threshold as well as the total number of children in the region, the percentage of 
children suffering from maximum packet loss is derived. As a result, the Manager 
can pinpoint the regions which are suffering severely from high packet loss. This 
percentage can turn out to be the same as in another region. However, the case of 
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one member suffering from maximum packet loss out of a total of 5 members in 
the region is different than the case of 100 members suffering from maximum 
packet loss out of 500 in the other region. 

Moreover, the mean, the median, and the standard deviation of packet loss that 
lies between minimum and maximum threshold and that is incurred by all local 
members, can be used by the Manager. The Manager computes the distribution of 
packet loss and hence can detect whether packet loss from most members in this 
range lies nearer to the minimum threshold or nearer to the maximum threshold or 
in between. The same derivations apply to packet loss greater than maximum 
threshold. 

Every AGR contains the IP address of the Parent of the AG which is sending this 
AGR. Consequently, the Manager can trace back congestion and detect if it is also 
spreading in other neighbouring regions (i.e. region of the Parent) or if it is only 
limited to the current region. 

The IP address of the receiver that is suffering from the maximum packet loss 
might be used by the Manager to launch an mtrace between the sender and the 
receiver to diagnose network problems along the multicast distribution tree and to 
detect hops showing a significant amount of losses (Thaler, 1997). 

In some cases, if some applications still insist on using sender-based adaptive 
schemes, then S-RTCP can be adapted so that the Manager sends the packet loss 
value of the receiver suffering most from the highest packet loss incurred in the 
current interval to the sender. The sender may decrease its rate of data transmission if 
necessary. 

By storing statistics about several consecutive intervals, it can be detected 
whether the network performance is improving or not. 

The estimations mentioned above are derived from short-term statistics (i.e. 
statistics within an interval). Moreover, similar analysis can be performed on long 
term statistics to evaluate the quality of the distribution of data during the whole 
session. 

In addition, the statistical data which is gathered and analysed can be used by an 
Internet Service Provider (ISP), a network administrator, or a technician to estimate 
the quality received by each region during intervals and during the whole 
transmission. Furthermore, the ISP can detect the popularity of individual sessions 
and derive a rough estimate of regions which were densely populated during the 
whole period of transmission. 

The next section presents the benefits of using our scheme. 

5 BENEFITS OF OUR SCHEME 

The following are the advantages we claim of using our scheme in large RTCP 
groups: 

• Resolving the storage scalability problem : Members do not store state about 
every distinct member in the group because they do not need to know the group 
size. 
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• Timely reporting of feedback reports: Feedback reports become more useful 
because the RTCP reporting interval does not depend on the group size so 
feedback delay is minimised. Hence, the experience of members during short 
intervals of the whole transmission is accurately reported. 

• Effective use of the bandwidth : Using our scheme, the number of incoming 
RRs to every member in the group is decreased and there is limited travelling of 
RRs. This is because of the formation of local regions where RRs are not 
multicast but are sent with limited scope and not global scope. 

• Decrease in the number of redundant reports: Even though, in every region we 
can still have redundant RRs sent to the AG of the region, the total number of 
redundant RRs, which used to be multicast, is decreased. In addition, 
measurements in RRs are aggregated into AGRs summarising the quality of the 
received data 

• Useful statistics to be used in network diagnosis and in charging: The 
aggregated statistics received by the Manager can help a network administrator 
to diagnose problems in the network. In addition, these QoS measurements can 
help an Internet Service Provider (ISP) to estimate the quality received in 
certain regions and the total number of members in the group can show the 
popularity of individual sessions. 



6 SUMMARY AND FUTURE WORK 

We have presented the problems encountered in the deployment of RTCP in large 
dynamic groups. Also, we discussed the functionality of RTCP Receiver Reports 
(RRs) in the current Internet where lots of adaptive applications are using receiver- 
based rate-adaptive schemes instead of schemes based on sender adaptations. We have 
designed a Scalable RTCP (S-RTCP) scheme in which members are organised 
dynamically in local regions; every region has an Aggregator (AG) that receives 
RRs locally from its members, extracts useful information, derives some statistics, 
then sends this information to a Manager. The Manager monitors the quality of the 
data distribution and performs some statistical analysis to estimate which regions 
are suffering from congestion. We believe our scheme reduces some of the RTCP 
scalability problems encountered in large groups, namely feedback delay, increase in 
storage state, and ineffective use of the RTCP bandwidth especially for receivers 
connected through low bandwidth links. In addition, our scheme directs important 
information included in RRs to an entity that can make valuable use of them. 

In the next phase of our work, we are simulating S-RTCP using the network 
simulator (NS) (McCanne, 1998) and we will report the results in due course. Also, 
we intend to investigate more functions that the Manager can do, analyse the 
limitations of our design, and try to refine it. 
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Abstract 

In this paper the utilisation of the Probability of Congestion (PC) as a bandwidth 
allocation decision parameter is presented. We assume short buffers at the switch 
nodes to cope with cell level multiplexation contention (“bufferless” 
environments). Therefore, delay and cell delay variations are strongly bounded. 
Moreover, the Cell Loss Ratio (CLR) becomes the critical performance parameter. 
The validity of PC utilisation is compared with quality of service parameters in 
bufferless environments. The convolution algorithm is an accurate approach for 
Connection Admission Control (CAC) in ATM networks with small buffers. 
However, the convolution approach has a considerable computation cost, in terms 
of calculation and memory. To overcome these drawbacks, a new method of 
evaluation is proposed and analysed: the Enhanced Convolution Approach (ECA). 
In complex scenarios, with ECA, PC calculation can be carried out in real time 
while maintaining the desired accuracy. Several experiments have been carried out 
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to compare the demanded bandwidth evaluated by: analytical methods, simulations 
and measurements in actual ATM switches. The main contribution of this paper is 
the proposal and analysis of ECA to the PC-evaluation for use in CAC schemes 

Keywords 

Connection Admission Control, Traffic Management, B-ISDN and ATM 

1 INTRODUCTION 

The Asynchronous Transfer Mode (ATM) transport network is based on fast 
packet switching using small fixed-size packets called cells. ATM permits flexible 
bandwidth allocation, so an important objective is to obtain the maximum 
statistical gain on a shared resource: the physical link. However, the bursty nature 
of the ATM traffic imposes strict requirements for traffic control. This paper is 
focused on the utilisation of Connection Admission Control as preventive control 
method to real time services. 

Call Admission Control (CAC) is a procedure responsible for determining whether 
a connection request is admitted or denied. The procedure is based on resource 
allocation schemes applied to each link and switching unit. CAC schemes may be 
classified as non-statistical allocation (peak allocation) and statistical allocation, 
this paper relates to the second case. In statistical allocation, bandwidth for a new 
connection is not allocated on the basis of peak rate; rather the allocated bandwidth 
is less than the peak rate of the source (the sum of peak rates may be greater than 
the capacity of the output link). The determination of a simple and efficient CAC 
policy is one of the major challenges in the design and implementation of an ATM- 
based B-ISDN. 

The maximum statistical multiplexing gain can be achieved if the network knows 
the probability distribution density function of the individual sources. The network 
needs a complete characterisation of sources with a known behaviour in statistical 
terms. A set of standardised parameters describes the behaviour of YBR 
connections in statistical terms. This parameters are: Peak Cell Rate (PCR), 
Sustainable Cell Rate (SCR) and Burst Tolerance (BT). The VBR bearer capability 
can be partitioned in two types: a) real-time (rt-VBR) that requires tightly 
constrained delay and delay variation, (as voice and video interactive applications), 
and b) non-real-time (nrt-VBR) where only a maximum cell transfer delay is 
considered (e.g. data transmissions with QOS guaranteed). This paper is mainly 
focused on CAC aspects relating to (rt-VBR) traffic management. 

Adequate traffic characterisation is required to properly design and operate the 
ATM network, but the wide range of possible future services make this task very 
complex (Kleinvewillinghdfer, 1991). Inevitably, any characterisation of traffic 
must be in terms of the specific times at which cells are generated by the traffic 
source. 

2 BANDWIDTH ALLOCATION IN ATM NETWORKS 
2.1 Previous Work 

The exact evaluation of the possible connections onto a link, maximising the 
statistical multiplexing gain with guaranteed QOS, is a difficult aspect in ATM 
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networks management (Castelli, 1991), (Hui, 1988), (Bolla, 1997) and (Ohta, 
1992). This is due to QOS parameters dependencies: assigned bandwidth and 
buffer size in a link. 

J. Hui proposed a multilevel congestion and control model mechanism in (Hui, 
1988). This model defines three different levels: the cell (packet), burst and call 
level. Those levels are based on the behaviour of the integrated traffic. Different 
statistical parameters are required to define the traffic at each level. 

With reference to the QOS parameter CLR, it can be analysed in both, cell (CLRc ) 
and burst (CLRb) levels. CLRb (burst) is the dominant factor for large buffers and 
CLRc (cell) is the dominant factor for small buffers (Castelli, 1991), (Handel, 
1994) and (Miyao, 1993). It is very interesting to analyse environments with 
buffers large enough to make CLRc negligible, but small enough to trail the 
approximation for CLRb close to a bufferless model. This relevant aspect is further 
detailed in this work. 

It is possible to assign an equivalent bandwidth (effective bandwidth for some 
authors) to each source that reflects its characteristics. The notion of “effective” 
bandwidth for each connection aims to summarise in a single parameter the 
bandwidth and QOS requirements of a connection. At the burst level, two different 
approaches for equivalent bandwidth evaluation are studied by (Guerin, 1991) and 
(Gallasi, 1989), in which different aspects of the behaviour of multiplexed 
connections are considered and fluid-flow model and stationary bit-rate 
distributions are presented. The fluid-flow model is also studied by (Castelli, 
1991). 

The fluid-flow model estimates the equivalent bandwidth when the individual 
impact of connections is critical. This model does not consider any multiplexing 
aspect. The fluid-flow model assumes that the information arrives uniformly during 
a burst and that the server removes the information from the queue in the same 
manner. Yang and Tsangin (Yang, 1995) describe an approach to estimate the cell 
loss probability for traffic scenarios with identical traffic sources (homogeneous 
traffic). 

Stationary approximations: In this case the effect of statistical multiplexing is the 
dominant factor, and it considers that cells are lost when the instantaneous rate is 
greater than the bandwidth provided by the link. Small buffers are not effective at 
the burst level. Three methods are introduced below. 

Binomial: the distribution of the aggregate bit rate on a link can be determined 
from the stationary distribution of the Markov chain formed by the superposition of 
sources. 

Gaussian: this scheme (also referred to as both the normal approximation and the 
two-moment allocation scheme) assumes the independence of the traffic behaviour 
of the connections and characterises the multiplexed traffic by a normal 
distribution. The sum of the means and the sum of the variances of each connection 
give these parameters. A connection is only accepted if the congestion probability 
derived from the tail of the normal distribution is less than a pre-specified 
threshold. The Gaussian assumption is not applicable when there are small 
numbers of very bursty connections with high peak rates, low utilisation, and long 
burst periods. 
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Convolution: the exact distribution of the aggregate bit rate can be determined by 
convolution using the exact bandwidth requirements of each traffic type. This 
method is based on the formula: 

P(Y + X = b) = £P(Y = b-k)P(X = k) (1) 

k=0 

X and Y refers to the bandwidth requirement of the new connection and of the 
already established connections respectively; and b denotes the instantaneous 
required bandwidth. The above expression allows the evaluation of the distribution 
function of the demanded bandwidth on a link; this method is explained in detail in 
section 4.1. 

The Decision Criterion in order to accept a new connection X when convolution is 
used in CAC is based on the Probability of Congestion (PC): 

PC(Y + X) = P([Y + X]> C) = £P(F + X = b) < £ (2) 

b>C 

Heuristic Methods: the other group of CAC approaches is based on heuristics and 
data modelling techniques (Saito, 1992). The neural network and fuzzy logic based 
approaches are example of this kind of approaches. Heuristic approaches provide a 
mechanism for clustering data obtained from ATM traffic measurements in a 
structure that constitutes the traffic model. A common example is a net structure 
composed of a set of neurones and respective connections for neural nets and a rule 
structure composed of a set of “if-then” associations of variables in the case of 
fuzzy systems (Ramalho, 1994). 

2.2 Drawbacks 

Several limitations have been found in the previous studies: Inter-dependencies: 
all studied models describe the behaviour of the sources without considering their 
interactions inside the network. The feasibility of performance objectives in ATM 
networks with correlated traffic is also studied in (Hee, 1993). Heterogeneous 
environments: normally, the performance of the evaluation approximations loses 
accuracy in heterogeneous scenarios. In these environments there are a trade-off 
between the Integration (Complete Sharing) vs. Segregation (Complete 
Partitioning) approaches. Calculation effort and accuracy: accurate evaluations 
have been simplified in order to reduce the complexity of calculations and the 
required memory, consequently, a reduction of the accuracy is obtained. In 
(Guerin, 1991), the convolution approach is first applied solely as a binomial 
distribution over homogeneous sources. Later, the Gaussian distribution is 
proposed as an approximation to the exact value. Cell Loss Probability 
evaluation (individual CLR): normally, different classes of traffic are segregated 
to different VPs. Therefore, all individual connections have the same QOS. 
Nevertheless, it may be more efficient to transport different classes of traffic by the 
same VP; then, connections on the same VP have different CLR, and, thus, imply 
different QOS for different classes. This individual CLRi for each class of traffic is 
difficult, or impossible, to obtain. 
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2.3 Experiments of Bandwidth Allocation based on analysis 

In the next stage of the work, experiments have been carried out in order to 
compare the behaviour of different bandwidth allocation approaches for a set of 
scenarios. Fluid-flow, Linear Approximation and Gaussian are contrasted to the 
Convolution Approach. 

Convolution results are compared to Linear, Fluid-Flux and Gaussian 
approximations. They have been evaluated in the following manner: 

• Linear method: the maximum number of sources that the link can transport Nj 
is evaluated by an exact method. The effective bandwidth is C/Nj for the j-type 
sources. Similar procedure is applied to the remaining types. In this case, it is 
necessary to derive its value off-line; this may be achieved either through 
analysis or through simulation and experimentation. 

• Fluid-Flow model: this approximation has good accuracy when either the 
number of connections is small or the actual total equivalent capacity is 
reasonably close to the overall mean rate. An approximation presented in 
(Guerin, 1991) is used. 

• Gaussian allocation approach: this method is based on the assumption that the 
distribution of the required bandwidth of the existing calls can be 
approximated by a normal distribution with the same mean and variance. That 
allows the use of standard approximations to estimate the tail of the bit rate 
distribution. Formula (6) has been used for this approach. 

• Convolution Approach. An exact evaluation of the instantaneous rate on the 
link is obtained. Utilising the bufferless assumption, Convolution is used for 
Bandwidth Allocation. In section 4, a detailed study is presented. 

All methods calculate a demanded bandwidth in order to ensure a pre-set upper 
bound for cell loss ratio. 

Homogeneous traffic experiments: for these experiments a set of 50 On-Off 
sources has been analysed. These sources have a mean burst period equal to 100 
ms; and the maximum PC allowed is 10 s . 

Fig. 1 shows the demanded bandwidth (y-axis) evaluated by both: Fluid-Flow and 
Convolution. The number of sources is 50; and each connection has a peak rate 
equal to 4 Mb/s The utilisation of each source varies from 10% to 80% (x-axis). 
The fluid-flow model is evaluated for different buffer sizes (b = 0.01, 1, 2 and 3 
Mbit), b = 0.01 means in fact small buffers, no differences have been obtained for 
buffer size up to 128 cells. 
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Fig. 1 Demanded Bandwidth vs. utilisation 

For source utilisation higher than 25% and with a buffer size of 3 Mbit the required 
capacity evaluated by convolution is greater than the required for the Fluid-Flow 
approximation. Furthermore, when the size of the buffer is 1 Mbit, the same effect 
occurs when the source utilisation is higher than 55%. When using small buffers, 
the convolution approach always gives more accurate results. Note that small 
buffers are used in our study to limit maximum cell delay and jitter. Consequently, 
Fluid-Flow approximations will not be taken into account in the following 
experiments. 

On the other hand, the convolution approach always evaluates a more accurate 
demanded Bandwidth than the Gaussian model. From 10 % to 40 % over- 
estimation is observed with the Gaussian model compared to Convolution, for 80 
% of source utilisation. 

Heterogeneous traffic experiments. Several experiments with mixed traffic are now 
presented. On the y-axis the demanded bandwidth is shown and on the x-axis all 
the possible combinations of traffic are presented. The source characterisation has 
been chosen in the GMDP model. The following table shows the characteristics of 
the used sources. 



State 0 State 1 





Rate 

(Mbits) 


Prob 


Rate 

(Mbits) 


Prob 


A2 


0.4 


0.625 


2 


0.375 


B2 


2 


0.625 


10 


0.375 


C2 


6 


0.625 


30 


0.375 



Table 1 Source description 

When two types, B2 and C2, are mixed, and the maximum number of connections 
is 7 and 42 respectively. In this scenario the Gaussian approximation is also 
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accurate, except for combinations of traffic that correspond to a small number of 
connections. 

In the following experiment, three types are mixed: A2, B2 and C2, the maximum 
number of connections is 15, 10 and 3 respectively. 




Fig. 2 Demanded Bandwidth for heterogeneous traffic 

For simplicity, only combinations with one C2 connection and varying B2 and A2 
types of traffic are plotted Fig. 2, since the remaining combinations show similar 
behaviour. The linear approximation has a changing behaviour that tends to be 
conservative. Differences up to 50%, case of low number of connections, can be 
observed. 

Using small buffers, a premise in this study, the convolution approach gives always 
more accurate results than Fluid Flow approximations. The required capacity 
evaluated by convolution is normally less than the one required for the Gaussian 
assumption. In the presence of bursty traffic, utilisation less than 20%, the 
Gaussian evaluation is too optimistic. Guerin et al. (Guerin, 1991) propose using 
Fluid-Flow approximations in this situation. Moreover, for small buffers the 
required bandwidth is perceptively higher than the Gaussian. Moreover, it is not 
clear how to define this situation. This effect is unwanted because the actual cell 
loss may be greater than the evaluated cell loss ratio and, consequently, the QOS 
requirements established for the user could not be guaranteed. Finally, the linear 
approximation varies from pessimistic to optimistic depending on the mixture of 
traffic. 

3 THE PC AS B W ALLOCATION DECISION PARAMETER 
3.1 ATM network model 

There are some services for which the QOS has real time constraints (i.e. 
interactive services), for delay; these services are considered in the standards as rt- 
VBR bearer capabilities. Therefore, very large buffers cannot be introduced and 
buffer dimensioning is carried out taking into account the cell level contention. 
Also, suitable buffer sizes can be selected to ensure that the maximum cell delay is 
less than a pre-specified limit (Yang, 1993). Under that premise, Cell Loss Ratio 
(CLR) is the major relevant Parameter of QOS. 
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Therefore, the buffer size of the statistical multiplexer is assumed to be small, in 
order to guarantee an acceptable maximum delay (e.g. 50 cells corresponds to 140 
ps., link rate =150 Mbit/s). LAN interconnect is an important example of a service 
which requires both a low loss rate and a low end-to-end delay (Wright, 1989) due 
in the main to time-outs in the LAN protocols. 

3.2 The Probability Of Congestion. 

To work with small buffers implies that the traffic bursts cannot always be saved in 
the buffer. Therefore, the burst length is irrelevant because the majority of cells 
will be lost. On the other hand, it is not likely that users will be able to supply 
information about the burst length at connection set-up. 




Fig . 3 Congestion in an ATM link 



W and W* represent the offered and the carried traffic respectively. Under the 
bufferless assumption W-W’ means the lost traffic. Fig. 3 (a) shows the 
instantaneous aggregated rate of all sources connected against time. Fig. 3 (b) 
shows the probability associated to a given instantaneous aggregated bit rate of all 
sources; all situations corresponding to rates greater than C (at right of C) are in a 
congestion state. 

The probability of congestion (PC) is the sum of probabilities corresponding to 
rates greater than C, which is the shadow area. The probability of congestion does 
neither state how many cells are lost nor the duration of the congestion state, but 
only that there is cell loss (Iversen, 1991). The load region admissible is 
approximated using parameters such as; the mean load, the congestion probability, 
and the ratio of cells exceeding the link capacity for the total cell stream and for 
each individual connection. For arbitrary mixes, the load of the link may provide 
only little information about the cell loss probabilities. Cell losses are quite likely if 
the bandwidth required at the burst level exceeds the capacity of the link. These 
events are taken into account by the PC. 

PC(Y) = P(Y >C) = £/’(!' = R) (3) 

R>C 

Y means the bandwidth distribution in terms of instantaneous rate, and C is the rate 
of the link. However, the PC does not give any information about the number of 
cells lost in case of congestion unlike Cell Loss Ratio (CLR). In a short congestion 
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state, all cells may be buffered with no cell losses occurring. Nevertheless, when a 
burst’s duration is longer than the size of the buffer, then almost all cells exceeding 
the link capacity are lost. In this case, the relation between PC and CLR is 
approximated by: 



CLR(Y ) = 



Y,(R-C)P(Y = R) 

R>C 



E(Y) 



(4) 



If the buffer size is sufficient for cell contention, the evaluated CLR provides an 
upper bound to the total cell loss probabilities. The PC model is a stationary 
approximation, in other words, a probabilistic scheme. 



4 BANDWIDTH ALLOCATION BASED ON THE PROBABILITY 
OF CONGESTION 

In this section, different methods to obtain Probability of Congestion (PC) on a link 
based on the convolution function are presented. The cost involved in the PC 
calculation is also analysed. 

4.1 The Formula-Based Convolution Approach 

This section contains the calculation of the bandwidth requirements of the 
superposition of several sources. This approach is based on the well-known 
expression of the convolution procedure denoted by: 

Q=Y*X (5) 

which is evaluated by the following expression: 

b 

P(Y + X=b) = Y,P(Y = b-k)P(X=k ) (6) 

*=0 

where Q is the bandwidth requirement of all established connections including the 
new connection; Y is the bandwidth requirement of the already established 
connections; X is the bandwidth requirement of a new connection, and b denotes 
the instantaneous required bandwidth. This function is expressed as the probability 
that all traffic sources together are emitting at a given rate b. We take into account 
that the evaluated offered load is not the link load itself, but the load generated by 
all the traffic sources intended to be carried by the link. The link carries this load in 
non-congestion state only. 

The direct application of the expression (6) in order to evaluate the convolution is 
difficult in practice. In order to obtain the probabilistic distribution on the link, a 
vector containing all possible rate-probability pairs is defined; this vector is called 
System Status Vector (System-SV). To obtain the complete System-SV, the 
following process is carried out: whenever a new connection demand arrives the 
System-SV must be updated ; the corresponding Source-SV is used to do this 
update, and for each old System-SV element a set of new System-SV elements is 
generated. The rate of each new element is the sum of the existing rate and the rate 
corresponding to the state of the new source. The probability of each new element 
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is the product of the existing probability and the new probability corresponding to 
the state of the new source. By using this method N-l convolutions are needed for 
each new connection. The expression (5) can be re-written as (Iversen, 1990): 

Qn = Qn-1 * Xn ; n e { 1,2, ..., N-l) ( 7) 

Considering Q 0 = X 0 . Clearly, we should carry out N-l convolutions to obtain the 
global distribution. At any point in time, an ATM link can carry several thousand 
connections. As described in a previous section, when a connection is accepted, the 
new connection is convoluted with the global steady-state probabilities of all 
existing connections. When a connection terminates it would be preferable to 
deconvolve all existing connections. So the feasibility of the deconvolution 
operation is very important. The problem is that the global steady state has now 
changed; this means that some previously calculated values are lost. The reason for 
this is that a) the accumulation of probabilities, corresponding to the same rate and 
b) very small intermediate values, are not considered. Furthermore, by not 
truncating the state, the space required for storage increases; the number of 
arithmetic operations further increases. These aspects, relating to accuracy and 
cost, are more widely developed in the following sections. 

Drawbacks 

In (Iversen, 1990 and 1991), (Kroner, 1990), (Kaltenmorgen, 1992), (R1022, 
1990), (Del 122, 1991), (Miah, 1994) and (Ramalho, 1994) some limits of the 
Convolution Approach are pointed out. High cost in terms of storage 
requirements. Note That a huge amount of memory storage M is required by the 
System-SY. This requirement increases dramatically with the number of 
connections N and source states T. High cost in terms of calculation. The 
computing time depends on the complexity of the distribution itself. The time 
needed for the convolution increases with the number of states per connection. 
Individual QOS. The evaluated link status using a convolution approach makes no 
distinction between individual connection. Thus, the individual QOS, in terms of 
cell loss, for each type of source is not available. 

4.2 The Enhanced Convolution Approach (ECA) 

To overcome the drawbacks associated with the PC calculation, a new method of 
evaluation is proposed: the Enhanced Convolution Approach (ECA). In this 
method, the multi-nomial distribution function is first applied to groups of the same 
type of sources, and the global state probabilities are finally evaluated by 
convolution of the partial results obtained from the different existing groups of 
sources. 

The state of the link can be expressed as a function of the number of active 
connections of each service type (n 0 , n,, ... u , ... , n^). This is because the state of 
the link depends only upon a service’s occupancy. First the multinomial function is 
applied to homogenous sources producing intermediate results. Finally, from these 
intermediate results a final result is obtained by convoluting one element of a given 
class of traffic with one from each of the other classes. This process is called multi- 
convolution in this study. 
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Fig. 4 Overview of the method. 



The Multinomial Distribution Function (MDF) 



After computing the convolution the same rate may appear more than once in the 
System-SV. Which elements are repeated? How many times? This is expressed as 
a probability corresponding to the MDF, see (Ash, 1969) and (Hogg, 1989). 



P(n„, n„ ..., iv,) = 



N\ 



n 0 \n l \...n T _ i \ 



Po -Pi'-Pt-i 



n T - 1 



( 8 ) 



The corresponding rate can easily evaluated by the following expression: 



T - 1 

r(n 0 , n,, ..., iv.,) = ^n^r. (9) 

1=0 

Note that the probability of each source being in state s. is independent of the 
probability of the other source states. 

To evaluate the ECA, some data structures are necessary at this phase. For N 
connections of the same type, there is an associated Sub-Matrix (SMX). 
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SMX r is the generic row of SMX. The number of columns is equal to the number of 
source rates T. SMX stores the distribution of the connections from each state. The 
system load density function is obtained directly from the sub-matrix using the 
MDF expression. 

The multi-convolution procedure 

When there are different types of sources j (heterogeneous traffic), it is necessary 
to ‘convolute’ between all source types. To store all possible combinations relating 
to the system state, a System Status Matrix (SSM) is defined. The generic elements 
of the SSM, namely the general system status rows SSM r , are generated each by 

concatenating all possible combinations between the different sub-matrices rows 
SMX r , associated with the L different j -types of sources (j = 0, 1, ..., L-l). 



m r =<M w ,„M rH , w > 


V r=0,..„ Mj., 


(11) 


and from (12) 






SSM r — < 




(12) 



Based on the ECA algorithm, grouping connection in types, the following 
expression for the cell loss probability of the type-j traffic proposed: 

W 

£-^(W-C)P(Y=w) 

CLRj (13) 

' E(Yj) 

Wj is the rate offered by all type-j traffic when the instantaneous offered rate on the 
link is L and EfY) is the mean rate of all traffic of type-j. Both terms are easily 
obtained during the evaluation of PC based on the ECA algorithm. This is the 
evaluation of the CLR to a type-j traffic. 

This method is widely detailed in (Fabregat, 1995) and (Marzo, 1993). 



5 CAC BASED ON THE PROBABILITY OF CONGESTION 
This section is focused on reasonable real-time processing, storage requirements as 
well as the maximisation of statistical multiplexing gain. Obviously, all the 
methods studied intend to guarantee QOS for all established connections. 

5.1 Implementation Issues 

In order to evaluate the PC, it is not necessary to completely evaluate the entire 
distribution of the instantaneous rate. Using ECA it is possible to evaluate a part of 
the statistical distribution. The techniques presented next are attempted to evaluate 
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only the major relevant part of the system state: Congestion. All five cut-off 
improvements can be implemented simultaneously. 

Link Capacity Cut-off: the calculation of probability is only carried out in cases 
where the associated rate exceeds the bandwidth provided C. The associated 
probabilities with rates smaller than C are not calculated (i.e. the MDF is not 
evaluated). 

Probability of Congestion Cut-off. The PC of the system is compared with a 
previously set value in order to guarantee the specific QOS. Therefore, if during 
the process of calculation the current PC exceeds a pre-set value, the process is 
stopped and the calculation cost is thus reduced. 




Admissible 
PC Area 



Partial Sorting Cut-off. Furthermore, in each System Status Matrix (SSM) the 
rows generated are not examined in an arbitrary order, but are graded according to 
rate, so when the pre-set minimum rate C is reached, the process is terminated and 
a further saving computation time is achieved. 

Small probabilities Cut-off. When the probability obtained by the Multinomial 
Distribution Function, is less than a pre-set threshold value this result can be 
ignored. Therefore, the corresponding element is not stored. This threshold value 
depends on the maximum congestion probability and on the number of 
connections. 

Grouping states Cut-off. For those classes with a large number of connections the 
majority of information enclosed in the sub-matrix may be summarised by 
grouping states. This mechanism could be applied independently to each class of 
sources before evaluating the second phase in the calculation of PC. 

5.2 Cost Experiments 

All experiments have been based on a General Modulated Deterministic Process 
(GMDP) source model. The GMDP model describes the behaviour of a traffic 
source at cell and burst level. The number of states for j-type sources is Sj. In each 
state i (with i = 0, 1, ..., Sj-1), during the corresponding sojourn time SojTj,i, cells 
are sent with regular inter-arrival times (constant rate rj,i,) 

The evaluation cost is measured in terms of a number of different metrics: Storage 
requirements are measured in elements, each element has to store a rate and a 
probability. The time parameter corresponds to CPU time and is expressed in 
normalised time (seconds in the presented experiments). Sorting techniques are 
required to put the partial status vector in order. The quick sort algorithm is used 
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when necessary. The cost is expressed as xlog(x), where x is the number of 
elements to be arranged. Calculation cost is expressed as a normalised combination 
of additions and products 

The computational efficiency has been measured for both the formula based (basic) 
convolution and the ECA. A comparison in time is shown in the ‘speed-up’ column 
of the following table, the time obtained for the basic convolution is set to 1; the 
evaluations for the enhanced convolution approach are normalised to this value. 
Two types of sources (60 and 75 connections) are multiplexed in an 600 Mbit/s 
ATM link. In the presence of only one class of traffic the application of the ECA 
increases the speed-up factor. 
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Table 2 Cost results for heterogeneous traffic 

Sorting is the dominant factor for the formula-based convolution, whereas cost 
evaluation is the dominant factor for the enhanced convolution. Another conclusion 
is the efficacy of the small probabilities cut-off. The first direct implication is the 
reduction in the storage requirements. Moreover, this reduction in the intermediate 
vectors implies a rise from 5 up to 500 times faster in the carried out experiments. 

6 CAC EXPERIMENTS 

This Section discusses different aspects of the behaviour of cell streams in an 
output buffer corresponding to an ATM link and is illustrated by experimentation. 
CAC experiments relating to Fuzzy logic and (M+1)-MMDP approaches are 
presented. Measurements in an ATM test bed are also included. 

The experiments described in this section refer to a single ATM link and the QOS 
is expressed in terms of cell loss at the output buffer of an ATM switch. The traffic 
sources are VBR sources, modelled as On-Off sources described by the peak and 
mean bit rates and mean burst length. Experiments in homogenous scenarios 
reveals similar results than for heterogeneous scenarios and have been omitted in 
this paper. 

RACE projects provide ATM test-beds on which measurements and tests can be 
performed (Kuhn, 1994). This set of experiments enables comparison of the 
average cell loss results obtained from on-line measurements in the Exploit ATM 
test-bed in Basel (R2061/28, 1994) with the cell loss predictions given by both the 
ECA and FCAC approaches for homogeneous and heterogeneous traffic scenarios 
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on a single ATM link. Although FAC attempts to predict the maximum cell loss 
ratio per connection instead of the average cell loss ratio for the aggregate traffic, 
for the sake of comparison with the results obtained from on-line measurements, 
FCAC was trained to predict the average cell loss ratio instead. In Table 3 the 
traffic sources used for the comparison experiments are described. The link 
capacity considered is 155.52 Mbit/s and the output buffer size is 27 cells. 
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Table 3 Characteristics of the traffic sources 

4 A31 connections are mixed with 6 to 24 connections of B31 traffic in experiment 
B5. 4 A31 connections are mixed with 24 to 100 connections of C31 traffic in 
experiment B6. We can see in both experiments, that the cell loss results obtained 
by ECA and FCAC are very similar to those obtained by measurements in the 
Exploit test-bed. 

Considering that the buffer size used in the ATM test-bed experiments is small (27 
cells), it is not surprising that the cell loss predicted by the ECA is so close to the 
cell loss measured values. This also explains the slight difference between the 
measurements curve and the convolution curve as some of the generated cells can 
be stored in the server output buffer, and, therefore, a more optimistic cell loss is 
obtained (see measurements curve). 




Fig. 6 (a)Experiments B5 and (b) experiments B6 

The prediction results based on the convolution approach have been obtained using 
the ECA algorithm without considering optimisations. The cut-off mechanisms and 
other improvements biased to achieve a fast evaluation of the CLR can only be 
used for CAC. Anyway, for the experiments presented, the time required for the 
evaluation was negligible. 
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7 CONCLUSIONS 

In this paper the utilisation of the Probability of Congestion (PC) as a bandwidth 
allocation decision parameter has been presented. The validity of PC utilisation is 
compared with QOS parameters in small buffer environments when only the CLR 
parameter is relevant. 

To overcome the drawbacks of the formula-based Convolution Approach, a new 
method of evaluation is analysed: the Enhanced Convolution Approach (ECA). 
Sorting is the dominant cost factor for the formula-based convolution, whereas 
calculation is the dominant cost factor for the ECA. With reference to the cut-off 
mechanisms presented, the major conclusion is the efficacy of the low probability 
cut-off. This mechanism implies is a reduction in the storage requirements and a 
reduction of the evaluation time. The ECA also enables the computation of the 
Individual Cell Loss Ratio for each j-class of traffic. 

It can be summarised that the convolution algorithm seems to be a good solution 
for CAC in ATM networks with relatively small buffers. If the source 
characteristics are known actual cell loss ratio can be accurately estimated. 
Furthermore, this estimate is always conservative, allowing the provision of the 
network performance guarantees. We can also conclude that by combining the 
ECA method with cut-off mechanisms, utilisation of ECA in real-time CAC 
environments as a single level scheme is possible. 

Source modelling for more realistic traffic is now an open issue. The ECA 
utilisation does not take account of temporal references (burst length), so that the 
source parameterisation is simplified. On the other hand, more realistic traffic, such 
as VBR video sources, can be modelled as sources with more than two associated 
states. ECA can also be use with these new models. 

By simple analysis of the ECA evaluation (see Fig. 4), a parallelisation of the ECA 
algorithm is a further step (the evaluation of each sub-matrix corresponding to each 
class of traffic can also be obtained in parallel). 
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Abstract 

This paper introduces a rerouting algorithm for ATM networks - called the Pro- 
Active Search Protocol - which enables efficient rerouting in a simple and 
uncomplicated way. Therefor the protocol performs as many tasks as possible in 
advance to reduce the complexity. Basically this comes to working with shortest 
path algorithms based on a known network topology. In order to increase the 
rerouting performance this is combined with real-time actions after a rerouting 
trigger, by collecting accurate information about the network load. This is achieved 
by means of a simplified distributed approach. 

Keywords 
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1 INTRODUCTION 

In ATM Broadband Telecommunication Networks a wide range of services, each 
with particular demands and characteristics, will be offered to the user. An 
important issue in ATM Networks is the Quality of Service (QoS) guaranteed 
through negotiation contracts. The Quality of Service indicates the user’s 
importance of the service, translated in terms of guaranties for delays, limits on 
cell losses, etc. A contribution to enabling the QoS guarantees is provided by 
mechanisms in the network which allow for protection against undesirable events, 




190 



like network element failures or network area overloads, threatening the 
continuation and quality of the service provisioning, thus degrading contracts and 
decreasing customer satisfaction. 

Intelligent routing aims at efficiently using the available network resources with 
respect to survivability, congestion control and possibly support of mobility. In all 
of these fields there is the requirement of finding alternative routes to replace 
interrupted connections, whether it is a network failure, a congested link or a user 
movement that caused the interruption. Considering the common point of finding 
alternative routes, it is more logic to indicate this with the term rerouting rather 
than restoration, since the latter limits the application area solely to dealing with 
network failures. Therefor rerouting will be used categorically in the rest of the 
paper to emphasise the key issue in this context: to find in an efficient way an 
alternative route to reroute an interrupted connection. 

This paper introduces the Pro-active Search protocol, which tries to take 
advantage of static information present in the network and known in advance. At 
the same time the problem of acquiring accurate real-time information is tackled 
using some adjusted features of distributed restoration. The paper starts with some 
background on rerouting and some existing rerouting possibilities in Section 2. 
The new rerouting protocol is presented in Section 3, describing the general 
principles of the mechanism, followed by a description of the phases of the 
protocol. The protocol performance is discussed in Section 4 and Section 5 ends 
with a conclusion. 



2 REROUTING TECHNIQUES 

When looking somewhat more into detail in the survivability area, various 
mechanisms that deal with network failures have been proposed (see below). Their 
main characteristics, also applicable for the general rerouting problem, are 
important parameters which inherently influence the performance of the protocol: 
the moment when the alternative route is calculated or searched, and the control 
under which the restoration process takes place. 

With pre-planned restoration, the routes are calculated in advance taking 
assumptions into account on network topology, network traffic, failure scenarios 
etc, and they are stored in databases. Upon the occurrence of a failure the routes 
only have to be looked up in the database, resulting in a fast restoration technique. 
However this only accounts for situations that have been anticipated in advance. 
Unexpected situations cannot be dealt with. Restoring at real-time on the other 
hand, implies that the route is only searched after the failure has occurred, resulting 
in a slower process, but with the advantage of a more accurate technique able to 
react at network changes even at failure time, making it a more robust technique. 
An efficient balance between the two extremes could result in a robust and time 
efficient technique. 
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Considering the control of the rerouting process, there can be an overall central 
control centre, coordinating every action of the process. The network nodes are 
unable to operate without its commands, making the overall mechanism highly 
vulnerable. With distributed control, every network node is provided with 
intelligence to carry out the restoration. Therefor they exchange information and 
take independent decisions to obtain an alternative route. There is no central 
overview of the network state, the nodes search a route by exchanging requests for 
spare resources. A problem with this distributed control is the increased 
complexity, mainly due to the high volume of messages flooding the network in 
parallel. Another drawback is the inability to influence the process in any way and 
its uncertain outcome. Again these are two extremes, an appropriate balance could 
result in a more efficient mechanism. 

Fully distributed rerouting mechanisms have already been investigated (Han 
Yang, 1988) (Komine, 1990) (Struyve, 1996) (Lievens, 1996), with satisfying 
performance results. However the above mentioned drawbacks of complexity and 
unpredictability have led to the research for other techniques. 

The Backup Virtual Path protocol, described in (Kawamura, 1992), is such a 
mechanism. Basically this technique provides for every working Virtual Path (VP) 
a backup Virtual Path, which takes over the working traffic if the working VP is 
interrupted for whatever reason. The route of the backup VP is calculated in 
advance and the VPI/VCI translation tables of the nodes on that route are already 
filled in. In contrast with protection, the backup VP does not take in any bandwidth 
in normal working conditions. When a failure occurs in the working VP, the 
backup VP is activated, bandwidth is captured and the backup VP takes over the 
traffic of the interrupted working VP. Variations are possible, but in its basic form 
the issue of unexpected events especially on the routes of the Backup VPs remains: 
if the backup VP is disturbed, the working VP cannot be restored. 

Cooperations between backup VPs and a distributed rerouting protocol, which 
copes with these unexpected events, have been proposed as well (Chen, 1997), 
however by increasing the overall rerouting complexity. 

Apart from these specific efforts in the restoration area, standardisation efforts on 
behalf of routing, signalling and rerouting in and between private ATM networks 
have been ongoing, being the Private Network-Network Interface (PNNI) (ATM 
Forum, 1996). In PNNI networks a hierarchical routing architecture is established 
by recurrently dividing the network nodes in Peer Groups with parent-child 
relationships. All the nodes are provided with databases, describing (different) 
parts of the network topology. The correct operation of PNNI is based on the 
information on topology, routes, links etc. stored in these databases, which have to 
be identical for nodes within one Peer Group. A lot of effort is put in ensuring that 
the information in these databases is accurate and up to date, using flooding and 
link state based protocols. Seen the efforts for ensuring that the databases are 
identical and up to date, the issue was raised to develop a protocol which could 
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take advantage of databases with pre-stored knowledge, while at the same time 
trying to include accurate information without lengthy updates. 



3 PRO-ACTIVE SEARCH PROTOCOL 

In developing the rerouting protocol the following goals have been taken into 
account throughout the design process. A very important issue has been to end up 
with a protocol that performs simple and efficient rerouting. Therefor it was 
important to consider what can be known and thus can be done in advance, 
however without degrading the performance because of inaccurate information. 
Therefor, some real-time actions with respect to required rerouting information 
have to be incorporated as well. 

In the following sections, the general basic ideas are discussed first, followed by 
a more detailed formulation of the actual algorithm phases. 

3.1 Background and basic principles 

When considering rerouting in networks, many network aspects influence the 
process. Two of the most important and contributing issues are a network’s 
topology on one hand and the network usage, i.e. the traffic or network load, on 
the other hand. 

The topology of a network, which includes the nodes of the network and their 
interconnection by links, remains reasonably fixed and stable over a long period of 
time, certainly on the time scale important for rerouting of affected connections 
(i.e. seconds). The traffic in the network on the other hand, is flexible and subject 
to more frequent changes, as users will regularly start up new connections while 
other connections are torn down. 

Since the topology is unlikely to change regularly on a short time basis (seconds 
to minutes), the topology can be regarded as a quasi-static given in the rerouting 
process. Therefor the rerouting protocol assumes that the network topology is 
known by the network nodes. This knowledge will be used to calculate candidate 
alternative routes in advance. Major changes in the topology, like the installation 
of new nodes or the upgrading of links, are distributed as updates to the nodes. 
This keeps them informed of the general ‘static’ network topology. Distributing 
topology information is a well-known issue and a common aspect in today’s 
networks like for example the Internet; also in ATM PNNI routing the spreading of 
topology information is very important. It is emphasised that these updates only 
occur in an orchestrated way at regular intervals, after a major change in topology. 
More detailed knowledge on how these updates are performed is irrelevant with 
respect to the further flow of the protocol. It is noted here that an unexpected event 
like a network element failure is not considered as a major topological change. 
From this point on the overall working network topology is assumed to be known. 
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The traffic in the network on the contrary, is more likely to change at much 
shorter time intervals than the topology. This makes the network load a real-time 
and more uncertain aspect and it is thus considered as a dynamic issue. Therefore it 
is assumed that the network nodes are not aware of the current overall network 
traffic. 




Figure 1 Network state correlation in time. 

A short intuitive note on the accuracy of network information, seen from the 
viewpoint of the elapsed time between getting the information and actually using 
it, can be found in Figure 1, which shows the qualitative correlation between 
knowledge of the current network state and knowledge of the network state at 
some point is the future. The network state for the rerouting purposes described 
here, is divided into topology state and load state information. Given the highly 
static character of the topology, the topology state correlation degrades only 
slowly: knowledge of the current topology will most likely imply knowledge of the 
future topology. The network traffic is dynamic, resulting in a rapidly decreasing 
correlation. Knowing the network traffic now, does not imply that the traffic is 
known at a moment in the future, when this information might be needed for 
rerouting. This implies that calculations on topology can be carried out in advance, 
while accurate information on traffic must be collected in real-time. 

When the network is confronted with unexpected changes in topology, caused 
for example by failures of network elements, this can interrupt working 
connections and it is important to restore these connections as fast as possible, in 
this case by finding alternative routes on which the connections can be provided 
again. That is dealt with by the real-time aspect of the rerouting protocol. It takes 
advantage of the knowledge of the network topology by using pre-calculated 
routes. Network traffic information is acquired at real-time using these routes. This 
deals with obtaining information on the bandwidth still available on links. 
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3.2 Phases in the algorithm 



Pre-rerouting phase: calculation of the routes 

The purpose of this pre-rerouting phase is to provide each network node with a 
number of paths between him and every other network node, thus taking advantage 
of the knowledge of the network’s topology. These paths are to be used in the next 
phase, where the actual rerouting will take place upon the occurrence of a 
rerouting event. 




Figure 2 Pre-calculated paths. 



A number of paths between every pair of nodes has to be calculated, which are 
stored in a database in the node (Figure 2). The paths can be determined because 
the topology of the network is assumed to be known to every network node. For 
the computation of the paths, an algorithm is required which is able to determine a 
number of paths between a pair of nodes, based on the topology of the network in 
terms of nodes and links. This algorithm can be a static and central one. 

It is preferable that the routes are efficient in terms of length, cost, etc. Therefor 
the Pro-Active Search Protocol uses an algorithm that calculates the k shortest 
paths between every node pair in a network. The qualification shortest is 
important, since alternative routes should be efficient in terms of spare resources 
and since these paths will eventually be used in a time sensitive environment: the 
sooner an alternative route is found, the less information is lost. 

Algorithms for finding k shortest paths between two nodes in a network can be 
found in literature. The algorithm used here is the algorithm of Yen, which finds k 
shortest loop-less paths in a network. The full description can be found in (Yen, 
1971). It uses an arbitrary iteratively applied shortest path algorithm, in this case 
Dijkstra’s. Furthermore Yen’s algorithm avoids loops in the paths, basically 
avoiding that nodes are present in a path more than once. This feature is important 
for the efficiency of the rerouting. As far as computation time is considered, Yen’s 
algorithm has an upper bound which changes only linearly with the number k. 

It is however important to note that basically any algorithm can be used here, 
which is able to calculate k shortest paths from a network topology in which the 
nodes, the links and their costs are given. A short overview is given in (Shier, 
1997). 
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In moderate sized networks it is feasible to calculate k paths between every pair 
of two nodes in the network, in other words to have every node store k paths to 
every other network node. When networks grow larger in size, this might become 
unpractical. An option is then to limit the pairs in geographical distance and only 
calculate paths between any pair of nodes not too far away from each other. 
Another solution can be found in hierarchical networks, like for example in PNNI 
networks [7]. There the paths can be confined within a peer group, limited in size 
anyway, and calculated between any pair of peer nodes. 

It is not considered efficient to store all the paths of the entire network or 
network group in some central location, which is consulted after a failure and 
which downloads the required paths. This is considered to be too vulnerable and 
time consuming. In stead every node has its own database, in which paths from 
this node to all or some other nodes are stored. When required, the paths are 
immediately available. 

The paths have to be computed prior to a rerouting event, explaining the term 
pro-active. Whenever a major topology update is issued, the routes must be 
recalculated, to ensure the effectiveness of the paths. This recalculation - using the 
same k shortest path algorithm as before, but now on the updated topology - is a 
background process, running in parallel or more accurately in between rerouting 
processes. Possible inconsistencies of databases between nodes are here not as 
important or threatening as for example in PNNI. Because a node knows the entire 
path, it must not rely on a next intermediate node to determine the appropriate next 
hop in the path. The network nodes don’t need identical databases to ensure a 
correct result. Moreover, since a number of paths are pre-calculated, the impact of 
one invalid route is low, at most only slightly decreasing the performance. 

Rerouting phase 

This is the part of the rerouting protocol that executes real-time actions in order to 
collect real-time link information. This will result in obtaining valid alternative 
routes, which will actually replace the failed connections. The required real-time 
information on the load of the network is collected by sending out search messages 
- so-called Courier messages - in a controlled way. 

This real-time rerouting part is triggered by a so-called rerouting trigger event. 
This is an unexpected network event, like the failure of a network element, 
interrupting working connections. Also a network link becoming overloaded with 
risk for congestion, can act as a trigger to search alternative routes avoiding this 
crowded link. 

As far as the rerouting is concerned, there is the choice to reroute on link or on 
path basis (Figure 3). With link rerouting, only the unavailable link will be 
replaced by an alternative route, the undamaged parts of the connections using the 
failed link are kept. In case of path rerouting, the entire connection(s) using the 
failed link will be rerouted. The latter needs some extra complexity because the 
connection endnodes must be notified of the rerouting and the capacity along that 
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connection on working links must be released. This requires some extra delay as 
compared with link rerouting, which is a local process and generally faster. 
However the total alternative route in link rerouting might be less efficient. 




path rerouting ' * 




Figure 3 Link and Path rerouting. 



The nodes between which the alternative route is searched - whether it are the 
link adjacent nodes or the connection end nodes - are called the restoration node 
pair. They will start up and eventually terminate the real-time process. In this 
protocol, the Sender-Chooser technique (Grover, 1987) is used, in which one of 
the restoration pair nodes starts sending out Rerouting Messages, while the other 
one will eventually choose the alternative routes found during the rerouting 
process. They are called Sender and Chooser node (Figure 4). The arbitration of 
Sender and Chooser is mostly based on node IDs. It should be noted that in case of 
path rerouting different Sender-Chooser pairs will be active in the network, one for 
each failed connection. This can complicate the process. 




Figure 4 Sender-Chooser arbitration. 

Actions of Sender and Chooser nodes 

The Sender node consults its database for the k pre-calculated shortest paths to 
the Chooser node. On each of these k paths, the Sender node sends one Courier 
Message to the first node on these routes, which represent the possible alternative 
routes (Figure 5). Every Courier Message collects and stores real-time information 
about that route on its way to the Chooser node. Basically this deals with the 
available bandwidth of the links on the route. 

The Chooser node will initially just wait, until Courier Messages arrive from any 
of the routes. 
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Figure 5 Actions of Sender and Chooser. 



Controlled flooding: forwarding of Courier Messages 



The Courier Messages are forwarded from node to node along their specific 
route, meanwhile collecting appropriate real-time link information (Figure 6). 




Controlled flooding 
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Figure 6 Controlled vs. full flooding. 



This phase has characteristics of a distributed restoration algorithm: the network 
nodes perform the required rerouting actions without commands from some central 
control unit. Some of the major drawbacks of fully distributed algorithms however 
- being the uncontrolled flooding of restoration messages, the resulting complexity 
and the unpredictable outcome of the restoration process - is avoided by the 
controlled forwarding of Courier Messages. The messages are not flooded to all 
neighbours, which could create an avalanche of parallel messages roaming in the 
network, but they are only sent on specific and fully specified paths. Also the 
network area extent which is visited by the messages is controlled: only the links 
of the pre-calculated routes will carry rerouting messages. This implies that the 
network operator has more control on the alternative routes to be found, while still 
having the advantage of self-healing facilities in the nodes. 

The rerouting messages do not reserve bandwidth along the route. They only 
collect the real-time load information and bring this information to the Chooser 
node. This strategy avoids complex release protocols at the end of the process, 
required to release bandwidth previously reserved for a particular failure but which 
is no longer needed because, for example, other routes have been found. 

A Courier Message will only be forwarded on the next link of the route if there is 
still some available bandwidth on that link. This implies that a Courier Message 
encountering a link without spare capacity is not forwarded (Figure 7). As far as 







198 



that particular route is concerned, the rerouting process stops: this is not a viable 
alternative route. No cancel or release messages are required. 




Figure 7 Pre-calculated path unavailable. 

Actions of the Chooser node 

Assume that a Courier Message arrives in the Chooser node. With the collected 
information about the links’ load in the message, the Chooser node can build up an 
overview of used and available capacity on the routes between Sender and 
Chooser. On that basis the Chooser can observe whether a route has available 
capacity and decide to take a route as actual alternative route. The Chooser 
consequently adjusts its capacity allocation overview to indicate that some of the 
available capacity is taken by an alternative path. This is mainly important when 
the k routes have some links in common and conflicting situations in the allocation 
of the available capacity can occur. The choice of an actual path by the Chooser is 
done on first-come first-take basis. It is also possible to wait for a certain time 
period and then to choose between the arrived candidates the route with the largest 
sufficient bandwidth. 

Route confirmation 

When a route is chosen, the Chooser node sends back a Confirmation Message 
on that route towards the Sender node. This message will ensure that the nodes on 
the route will adjust their routing tables and actually reserve the capacity for that 
route. In that way, the message is rippled back to the Sender node, which then 
knows that a valid alternative route is found. 

It is possible that on the way from Chooser to Sender, a link is encountered 
where the capacity assumed to be free is no longer available because for example 
another rerouting process has taken in that bandwidth. In that case a Cancel 
message is sent back to the Chooser node and, if available, another route is taken 
in stead. 

The rerouting process is terminated when all failed capacity can be rerouted 
along alternative route(s) or when there is no more available bandwidth on the k 
routes. In the latter case some of the failed capacity cannot be rerouted. 
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Remark on the k shortest paths 

As can be derived from the previous sections, the nature of the pre-calculated paths 
highly contributes to the success chances of the protocol. At the end of the 
rerouting phase, some of these paths will be used as the actual alternative route(s) 
for the affected connection. It is obvious that these paths will seriously affect the 
final rerouting performance. 




Fully disjoint Shared links 

Figure 8 Nature of the pre-calculated paths. 

The number of pre-calculated paths is an important parameter in the rerouting 
process, influencing the performance and the rerouting capabilities of the protocol. 
The number k must be large enough to have reasonable chances that a feasible 
route can be found eventually. 

The paths must also be efficient in terms of the used links. This is concerned 
with how the k calculated paths between a node pair are related to each other, with 
respect to using the same links and nodes (Figure 8). With Yen’s algorithm it is 
highly probable that the k shortest paths between two nodes share a number of 
links and nodes. If such a shared link has little available capacity, this means that 
immediately a number of the k paths become unavailable as alternative routes, 
degrading the rerouting chances. The number k does not directly refer to the 
number of disjoint paths between a node pair. However it is important to realise 
that fully disjoint paths are not required for this protocol, simply a number of pre- 
calculated routes covering a certain area which can be searched in a controlled way 
at real-time. Besides, the connectivity degree of the network limits the maximum 
number of disjoint paths due to the limited number of outgoing links out of a node. 

An simple extension has been added to the Yen algorithm to avoid that in an 
extreme case a link is used by all of the k pre-calculated paths. Basically a 
threshold is used to set the maximum number of paths that can use a same link. 
This increases the rate of success, since this one link is a possible bottleneck to the 
rerouting process. The protocol leaves ample space for other algorithms that can 
calculate k paths between two nodes. 

The use of pre-calculated paths also allows for a more particular choice of the 
routes. In stead of just calculating the k shortest paths, one could provide the 
databases with paths that cover only a certain network area, recommended for 
rerouting. This would allow network operators to enforce a certain rerouting 
strategy. 
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4 PROTOCOL PERFORMANCE 

In order to verify the behaviour of the protocol as well as to assess its performance 
in terms of rerouting capacity, the protocol was described in the standard language 
for protocol description, being the Specification and Description Language SDL 
(Z.100, 1988), by using the SDL Design Tool SDT, which provides the means to 
describe and simulate communication protocols. 

Each node is assigned a model of a FIFO queue and one processor. It takes 10 
ms to read in and process an incoming message and 10 ms to generate an outgoing 
message. The simulations were performed on a realistic 32 node network, provided 
with enough spare capacity to enable the rerouting of all single link failures. 




Time (s) 

Figure 9 Single link failures (k, threshold), link rerouting. 

The average rerouting degree for link rerouting of all single link failures in the 
sample network as a function of the elapsed rerouting time in Figure 9. When k is 
small, there is usually not enough total spare capacity to reroute all failed 
connections. The combination (20,10) showed the best degree. When comparing 
with a distributed technique simulated with the same parameters (Lievens,1996), it 
shows that indeed the in advance actions of this protocol do give good results. 

The same network was simulated for single link failures applying path rerouting. 
The obtained results are significantly less than with link rerouting. This is due to 
the fact that with path rerouting, different simultaneous search processes are 
ongoing in the network, one for each connection that used the failed link. These 
simultaneous processes compete with each other for spare capacity, thus 
explaining the performance degradation. 
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Figure 10 Double link failures; node failures. 



Simulations were performed for node failures and double link failures (Figure 
10), on the same network with fixed spare capacity, resulting in satisfying 
performance. For the node failures, path rerouting was used, explaining the longer 
time needed to achieve rerouting. 



5 CONCLUSION 

In this paper a new rerouting protocol is introduced for ATM networks. The Pro- 
Active Search protocol combines real-time action with path calculations carried 
out in advance. This is based on the assumption that a network’s topology is quasi- 
static. Therefore this topology is assumed to be known by the nodes, and they take 
advantage of this by calculating shortest paths between them and every other 
network node. The result is that each node knows k paths to another node, as far as 
the general topology is concerned. The network’s traffic is assumed to be dynamic 
and flexible; this is considered as unknown by the nodes. When rerouting is 
required, the nodes will gather accurate, real-time information of the load on the 
pre-calculated paths by sending a limited number of Courier messages on each of 
the k paths. In this way speed and accuracy are combined, resulting in a fairly 
simple protocol with the advantages of pre-planned restoration together with those 
of real-time distributed restoration. 
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Abstract 

For the implementation of multipoint-to-point connections in ATM, various 
approaches exist, each with its own advantages and disadvantages. VP-based 
methods require unique sender identification but they do not require reassem- 
bly in merging points. In contrast, VC-based methods do not require unique 
sender identification but they do require reassembly in merging points. It is 
likely that VC merging will be the method of choice as it is scalable and yet 
relatively simple to implement. One of its drawbacks is the increased out- 
put buffer space required at the switches because of packet reassembly at the 
merging points. This paper investigates the impact of the switch architecture 
and characteristics on the output buffer space by means of simulation. The re- 
sults obtained demonstrate that for typical switch architectures, VC merging 
does not require significant additional buffering compared to VP merging. 

Keywords 

IP over ATM, VC merging, VP merging 



1 INTRODUCTION 

In current ATM networks, there exist only point-to-point (pt-to-pt) and point- 
to-multipoint (pt-to-mpt) connections. For the interconnection of routers across 
an ATM network as well as for many other information-gathering applications, 
multipoint-to-point (mpt-to-pt) connections appear to be more appropriate. 
Interconnection of N routers requires order N 2 labels for the order of N 2 
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pt-to-pt connections as described by Calvinac et al. (1997). With mpt-to- 
pt connections only N labels for the N associated connections are necessary. 
This significantly reduces the required label space and thus makes the method 
more scalable. The same new ATM connection type could also be used in the 
context of merged connections for MPLS (Callon et al. 1997). 

To implement mpt-to-pt connections, different solutions are possible. We 
focus on the two most important methods: VP merging and VC merging. 

VP merging: Each sender is assigned a globally unique identifier having the 
format of a VCI. The identifier is carried in the VCI field of the ATM cell. The 
ATM switch translates incoming VPIs for the same destination to the same 
outgoing VPI. The receiver distinguishes amongst the different sources by the 
different VCIs. The key advantage of this scheme is that no VCC resources 
are required in the switching nodes as only VP switching is performed. This 
implies no change of hardware but only a change of the connection estab- 
lishment protocol. Some of the disadvantages of VP merging are the lack of 
scalability caused by the VPI address space limitation of 4096 entries and the 
need for a “global VCI uniqueness” protocol. There are proposals to circum- 
vent the nonscalability by enlarging the VPI address space at the expense of 
VCI address space. This is not desirable, however, because it requires changes 
in the switching hardware. VC merging: This method avoids the requirement 
for globally unique sender identifiers, and it consumes only one VCI per tra- 
versed link. These characteristics make this approach scalable. Each source 
participating in a mpt-to-pt connection has a unique VCI per link. The ATM 
switch translates incoming VCIs belonging to the same connection to a single 
outgoing VCI. This means that cells of packets belonging to different senders 
could be interleaved. As the receiver is not able to distinguish cells from dif- 
ferent senders, packet reassembly has to be performed at the merging points, 
and all cells from a given packet must be sent contiguously so that reassem- 
bly at subsequent merging points and at the receiver will be possible. AAL 
3/4 would solve the problem by introducing the Message Identifier (MID) 
field for sender identification in every cell. The use of AAL 3/4, however, has 
other drawbacks such as the limited space of the MID field, the inefficient 
encapsulation method, and the less powerful CRC capability. In this paper 
we consider the employment of AAL 5 because it is widely available and sup- 
ported in ATM switches, especially in data networks. Packet reassembly at 
the merging points introduces additional buffer requirements on the switching 
architecture because all of the cells of a packet sent by a sender belonging to 
a mpt-to-pt connection have to be stored and must wait for the last cell of 
the packet identified by the “End Of Packet” (EOP) marker used by AAL 
5 to arrive. Figure 1 depicts the cell interleaving problem. Packet reassem- 
bly also introduces additional delay for packets transported over a merged 
connection and adds burstiness to the traffic. This is because all the cells of 
a packet have to wait at every merging point. They appear afterwards as a 
burst of a whole packet at the output link. This burstiness becomes even worse 
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Figure 1 Cell interleaving problem. 



as it is often cascaded and thus accumulated over numerous merging points. 
Heinanen (1997) gives some hints about how to solve the problems involved 
in mpt-to-pt VC merging. 

A third possibility is to handle a mpt-to-pt connection of N senders to 
one receiver like N pt-to-pt connections without applying any merging. This 
possibility again requires order N 2 labels for the order of N 2 pt-to-pt connec- 
tions. Of the above possible solutions, VC merging appears to be the method 
of choice as it is relatively easy to implement and yet scalable. At the ATM 
Forum, VC merging has been almost fully accepted and will most likely be in- 
troduced in the PNNI v2.0 specification (expected to be finished in the spring 
of 1998). The only concern is with the reassembly required in the switches 
in terms of additional buffering and delay. The numerous simulations pre- 
sented in the following sections are used to investigate the required additional 
buffer overhead for VC merging. It is also very likely that different methods of 
merging and nonmerging will exist simultaneously in an ATM network. Some 
interworking aspects of these methods are discussed by Widjaja et al. (1997). 

Section 2 of this paper describes our switching architecture model for VC 
merging and the model of the arriving traffic. In Section 3 we show our sim- 
ulation setup and discuss the results of the simulations. In Section 4 we give 
a summary and derive some conclusions. 



2 SWITCH AND TRAFFIC MODEL 



2.1 Switch Model 

In this paper we consider the general class of single-stage, nonblocking MxM 
packet switches with both input and output queuing (Iliadis and Denzel 1993, 
Denzel et al. 1995). The shared output buffer is assumed to be sufficiently large 
so that the switch performance is close to optimal, corresponding to the pure 
output queuing. Cells are transferred from the head of the input queues to the 
shared buffer. The speed of the input and output switch ports is denoted i? 5 , 
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reassembly and output queue 




Figure 2 Switch architecture. 



and the speed of the outgoing links is denoted Rl . Let k denote the speed ratio 
of the switch speed (per port) to the outgoing link speed, i.e. k = Rs/Rl- 
Typically k is greater than one, which implies that an output queue should 
be provided in order to cope with the speed mismatch. 

As described above, VC merging requires an amount of additional output 
buffering due to the packet reassembly. We introduce a so-called reassembly 
buffer at each output port of the ATM switch. Figure 2 shows the concept 
of the reassembly buffer. A switch has M input ports and N sources of mpt- 
to-pt connections because it is likely that different connections will coexist. 
Hence N can be much larger than M . The model considered in this paper 
is valid for the case of N < M . The case of N > M is not covered by the 
present switch model and is therefore a subject for further investigation. At 
every merging point, each of the sources participating in the corresponding 
mpt-to-pt connection is associated with a distinct reassembly buffer at the 
output queue. When the last cell of a packet with the EOP marker arrives 
at the reassembly buffer, all of the cells of a packet are instantly transferred 
into a single output buffer per output port. Physically the reassembly and the 
output buffers of one output port share a common memory pool. The transfer 
from the reassembly to the output buffer can easily be done by a pointer 
movement and will therefore not incur additional delay. 

The simulation models for VC and VP merging are shown in Figures 3 and 
4, respectively. Cells belonging to the various VCs are transferred from the 
head of the switch input queues in the shared buffer and, subsequently, to 
the corresponding output queues. It is assumed that the traffic is uniform, i.e. 
the destination of an arbitrary packet can, with an equal probability, be any 
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Figure 3 Our simulation model for VC merging. 

arrival processes 




Figure 4 Our simulation model for VP merging. 



of the output ports, and that successive packets are independent regarding 
their output port destinations. Owing to traffic symmetry, all of the out- 
put queues have identical behavior. Let us turn our attention to a particular 
output queue and study its behavior. The corresponding simulation model 
considers N sources feeding the output queue in a round-robin fashion gov- 
erned by the factor k. This model is also appropriate for the case where the 
switch fabric is capable of transferring only a limited number of cells to any 
given output (Oie et al. 1989). 



2.2 Traffic Model 

The traffic and simulation model we use is shown in Figures 3 and 4. We use 
N arrival processes, which correspond to the traffic destined to the output 
queue. Packets are assumed to arrive according to either a Poisson process 
(nonbursty traffic with the mean arrival rate A) or a hyperexponential process 
(bursty traffic). The hyperexponential process is generated by a two-stage hy- 
perexponential distribution. The mean values corresponding to the two stages 
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are 0.51 * A and 16.48 * A, respectively. The corresponding routing probabili- 
ties for the two stages are 0.97 and 0.03, respectively, so that the mean arrival 
rate is again equal to A. Each packet is assumed to contain a number of cells 
geometrically distributed with a mean of E cells (Chao and Smith 1992, Wid- 
jaja and Elwalid 1997). We used E = 10, E = 30, or E — 180 cells (10 cells 
correspond to 472 bytes, 30 cells to 1432 bytes and 180 cells to 8632 bytes). 

It is shown by Widjaja and Elwalid (1997) that the mean packet size in 
a core network where ATM is likely to be applied is about 289 bytes. This 
yields 6.2 ATM cells of data using AAL 5 with the null encapsulation method 
as described by Heinanen (1993) (additional overhead of AAL 5 is 8 bytes per 
packet). The dominant packet sizes in an Internet backbone are 40 or 44 bytes 
at about 36% of the traffic (TCP acknowledgment packets, TCP control seg- 
ments such as SYN, FIN, . . . , and Telnet packets carrying single characters), 
552 or 576 bytes at about 25% (512 and 536 bytes of TCP implementations 
without path MTU discovery as the default maximum segment size (MSS) for 
nonlocal IP destinations, yielding a 552 or 576-byte packet size), 185 bytes at 
about 2.7%, and 1500 bytes at about 1.5% (Ethernet traffic). These statistics 
were collected on Feb 10, 1996, in FIX- West network as a sample wide-area 
network, and are given on the NLANR homepage (1996). A more recent study 
of traffic characteristics in an Internet backbone was conducted in August of 
1997 (Thompson et al. 1997). It is shown that almost 50% of the traffic is 40 
or 44 bytes in packet length. More prominent packet sizes are 532, 576, and 
1500 bytes, each representing 15% of the traffic. Comparing the two studies 
we observe a shift to smaller packets of size 40 or 44 bytes and larger packets 
of size 1500 bytes. 

For the future development of packet sizes, the spreading of the use of 
path MTU (PMTU) discovery will have a significant impact. PMTU will 
affect MTUs in IPv4 as proposed by Mogul and Deering (1990) and even 
more MTUs in IPv6 over faster LANs. There will be numerous packets with 
possible sizes up to 64 kilobytes (max. packet format for AAL 5 is 64 kilobytes 
(Laubach 1994)). A single packet of this size involved in reassembly could 
alone fill the entire reassembly buffer in a switch output queue. Atkinson 
(1994) gives an overview of other typical frame sizes being applied on AAL 5. 
These are 8 kilobytes used by the Network File System (NFS) and the 9180 
bytes of IP MTU over SMDS (Piscitello and Lawrence 1991) that became the 
default value for IP MTU over ATM AAL 5 (Laubach 1994). These big packet 
sizes in conjunction with VC merging could induce present problems that VP 
merging would not encounter. On the other hand there will also be much 
more real-time traffic (e.g. voice) in the Internet. Real-time traffic typically 
produces a large amount of very small packets. 
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3 SIMULATION SETUP AND RESULTS 

This study concentrates on the additional buffer space required for reassembly. 
It is conducted under loads l = 30%, 70%, 90%, with different traffic charac- 
teristics (bursty, nonbursty), factors k = [1, 16], N = 16, 64, 128 sources, M 
ports ( M > N ), and with a mean packet size of E = 10, 30, 180 cells. The de- 
fault values for the simulations are N = 16 sources, E — 10 cells, l = 90%, and 
k = 16 unless specified otherwise. Figures 5-13 show the VC merging buffer 
size (solid line) and the corresponding VP merging buffer size (dashed line). 
The results serve to compare VP and VC merging. They cannot, however, 
be used directly to show the required output buffer space in an ATM switch 
because no flow control has been taken into account. The simulations were 
carried out for an extremely large number of events such that 95% confidence 
intervals were very small. 

Figure 5 shows the results for l = 30%, 70%, 90%. The difference between 
the solid and the dashed lines (VC and VP merging for a specific load) is 
about 19 to 21 cells for l = 90%, about 21 to 23 cells for l = 70%, and about 
30 to 37 cells for l = 30% over some magnitudes of overflow probability. At 
high loads the output queue contains a large number of cells, which translates 
to long delays. Therefore, by the time the first cell of a packet is ready for 
transmission at the output link, the corresponding last cell has most likely 
arrived and, consequently, the packet reassembly has been completed. In this 
case, therefore, the additional overhead due to reassembly is almost negligible. 
It is also important to note that the workload of today’s switches normally 
lies at high levels of around 70% or 80%. In contrast, at low loads, the first cell 
may be ready for transmission while the reassembly is in progress. In this case 
it has to be delayed until the reassembly process has been completed. How- 




Figure 5 Simulations for l = 30%, 70%, 90%, nonbursty arrival process. 




210 



ever, owing to the low load, the number of packets under reassembly is small 
and, therefore, the additional buffer requirement of VC merging is minimal. 
The results obtained are in agreement with those presented by Widjaja and 
Elwalid (1997). Furthermore, the packet delay corresponding to VC merging 
approaches that corresponding to VP merging as the load increases. 

We then made the same simulations with bursty arrival processes. We model 
the bursty arrival process by a hyperexponential packet arrival process as 
described in the previous section. The results for l = 30%, 70%, 90% are shown 
in Figure 6. We see that the buffer requirements for both VC and VP merging 
grow significantly for high loads. Of course flow control would alleviate this 
problem to some extent due to the overall load reduction. The additional 
buffer requirements for VC merging compared to VP merging are minimal 
even for the case of bursty traffic. In particular, for high loads they become 
negligible for the reasons given above. 

Simulation results were obtained for different values of k and different loads 
1. By varying k we expected to see an influence on the additional buffer require- 
ment. Surprisingly, only the extreme value k = 1 resulted in a big additional 
buffer requirement for VC merging. It is obvious that VP merging requires 
almost no output buffer with k — 1 as the speed of the switch output port 
(Rs) is equal to the speed of the output link (Re,). We then tried to deter- 
mine the critical k for every load factor l considered. The critical value of k 
is defined as follows: For all values of k larger than the critical value, there is 
practically no distinction between VC curves and VP curves, whereas for all 
smaller values of k the curves start becoming distinguishable. We found that 
the critical k lies close to the extreme value k = 1. The range of the critical 
k is between 1.1 and 1.3 for l = 90% and l = 70%, respectively. This means 
that the critical k becomes larger with lower loads, but it is still far away from 




Figure 6 Simulations for l = 30%, 70%, 90%, bursty arrival process. 
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the values implemented in today’s switches (greater than 2). To substantiate 
these observations we investigated the critical k for l = 30%, too. In this case 
the critical value for k is approximately 1.5, which is still much smaller than 
2 and thus confirms our theory. 

Figure 7 shows the results of our simulations for k = 1.1, 1.2, 16 at l = 90%. 
We observe that all of the curves for the output buffer of VC merging at 
different values of k lie close together. The value of k = 1.1 is the critical one 
because the corresponding curve starts to show a deviation. The same applies 
to the curves for VP merging. For values of A greater than the critical one, 
the additional buffer requirement for VC merging at low overflow probabilities 
is minimal. However, the difference between VC and VP merging becomes 
noticeable for values of k less than the critical value. 




Figure 7 Simulations for l = 90%, k = 1.1, 1.2, 16, bursty arrival process. 

Figure 8 shows the results of our simulations for k = 1.2, 1.3, 16 at a lower 
load of l = 70%. In this case, we observe that the different curves for VC 
and VP merging lie fairly close together with the exception of the curves for 
k = 1.2. This shows that the critical k is slightly larger for l = 70% (about 
k = 1.2) than for l = 90% (about k = 1.1). Here again, the additional buffer 
requirements for VC merging at low overflow probabilities become noticeable 
for values of k less than the critical value. 

Figure 9 shows the results of our simulations for k = 1.5, 2, 16 at a low load 
of l = 30%. We observe again the similarity of the curves for VC merging 
over the entire range of k . The curves for VP merging vary slightly so that 
the additional buffer space becomes smaller for a larger k , with a critical 
k at about k = 1.5. There is a noticeable additional buffer requirement for 
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Figure 8 Simulations with Z = 70%, A; = 1.2, 1.3, 16, bursty arrival process. 




Figure 9 Simulations for l = 30%, k = 1.5, 2, 16, bursty arrival process. 



VC merging in the entire range of values of k . Furthermore, the additional 
requirement increases as k decreases. 

We then tried to investigate the possible influence of more specific traffic 
characteristics such as larger packet sizes and increased numbers of sources in a 
mpt-to-pt connection on the additional requirements of VC merging compared 
to VP merging. First, we performed simulations for a larger mean packet size 
of the arrival process ( E = 30). Figure 10 shows the curves for l = 90% and 
k = 1.1, 1.2, 16 with a mean packet size of E = 30. Compared to Figure 7 we 
observe a greater difference between the curves for k = 1.1 and for k = 1.2. 
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Figure 10 Simulations for l = 90%, k = 1.1, 1.2, 16, bursty arrival process 
and E = 30. 



It appears that the critical k is shifted to a value slightly larger than k = 1.1 
(between k = 1.1 and k = 1.2). Furthermore we see that the mean packet size, 
which is three times larger, requires an output buffer size that is also three 
times larger. Moreover, the additional output buffers for VC merging are about 
three times larger for E = 30. Therefore the additional buffer requirement for 
VC merging appears to grow linearly with the mean packet size. This trend 
is also verified by our simulations for E = 180. 

Figure 11 shows the results of the same simulation for l = 70%, k — 




Figure 11 Simulations for l = 70%, k = 1.2, 1.5, 16, bursty arrival process 
and E = 30. 
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1.2, 1.5, 16 and an increased mean packet size of E — 30. Compared to Figure 
8 we again observe a shift of the critical k from a value of about k = 1.2 to 
a slightly larger value. Concerning the additional buffer requirement for VC 
merging, the same observations were made as in the case of load l = 90%. 
This means that, also at this load, as the packet size increases, the additional 
buffer requirement increases by the same factor. 

Finally we performed simulations for a larger N (N — 64, 128) to assess the 
influence of a large number of sources associated with one mpt-to-pt connec- 
tion on the additional buffer requirements of VC merging due to reassembly. 
An increased number of sources could translate to an increased degree of re- 
assembly. This again would lead to a significantly larger required buffer space 
for reassembly than for nonreassembly. Figure 12 shows the results for the 
simulations for N = 16,64 sources, factor k = 16 and nonbursty traffic. The 
results obtained also apply in the case of nonbursty traffic. This is explained 
by Palm-Khint chine’s theorem (Heyman and Sobel 1982, p. 156), which states 
that summing up a large number of iid processes (for instance hyperexponen- 
tial processes as used for our bursty traffic) results in a process of Poisson 
type (our nonbursty traffic). As our simulation has 64 sources, each with an 
iid process for the arrival traffic, we are able to apply this theorem and to 
simulate nonbursty arrival traffic. All of the corresponding curves in Figure 
12 lie close together. Concerning VP merging, as N increases, the correspond- 
ing curves converge because the aggregated arrival process tends to a Poisson 
one. For VC merging, the buffer requirement does not increase with the num- 
ber of sources. This is because increasing the number of sources translates to 
decreasing the arrival rate per source such that the load at the output link 




Figure 12 Simulations for N = 16, 64, l = 70%, k = 16, nonbursty arrival 
process. 
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remains constant. This shows that our previous simulation results hold also 
for a larger scenario with a larger number of senders. 




Figure 13 Simulations for TV = 64, Z = 70%, k = 2, 4, 16, nonbursty arrival 
process. 

We have investigated the impact of varying k given large values of N. Pre- 
viously we found a critical k of about 1.1 to 1.3 at N = 16. Figure 13 shows 
the results of the simulations with N = 64 and k = 2,4, 16 with nonbursty 
traffic. The different curves for VC and VP merging again lie close together 
and we see no significant difference between the curves belonging to the val- 
ues k = 2 and k = 16. Consequently the critical value of k is smaller than 
2. Once again, for values of k greater than the critical value, the additional 
buffer requirement for VC merging does not increase. 



4 SUMMARY AND CONCLUSIONS 

VC merging is likely to become the method of choice to implement mpt- 
to-pt connections in ATM networks. Because of the cell interleaving problem 
created by VC merging, reassembly has to be performed in the merging points. 
The effect of reassembly has been investigated assuming an output queue 
switch architecture. The results obtained demonstrate that, at high loads 
and for arbitrary arrival processes, the implementation of VC merging in 
the switches will not require much additional buffer at the output queues of 
the switches. In contrast, at low loads, additional buffer is required but this 
is minimal. Furthermore, it was found that the additional buffer requirement 
for VC merging is proportional to the average packet size. Consequently, large 
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packet sizes can result in large reassembly buffer requirements. We further 
investigated the effect of the speed ratio between switch output port and 
output link and came to the conclusion that for sufficiently large speed ratio 
values (k > 2) the output buffer requirement for VP and VC merging remain 
the same, respectively. We found a critical k which grows with decreasing 
utilization and also with growing mean packet sizes of the arrival traffic. But 
it always remains between 1.1 and 1.3 for high utilization of 70% and 90%. 
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Abstract 

Multi-layer forwarding approaches (a.k.a. multi-layer switching or routing) which 
use ATM as transport technology, have proven not to scale enough unless route 
aggregation is performed. In ATM networks route aggregation implies stream 
merging: cells from different incoming streams are switched to the same outgoing 
link and labeled with the same stream identifier. This identifier could be either the 
whole VPI/VCI pair (VC merging) or only the VPI (VP merging). Stream merging 
approaches are quite often referred to as VC merging approaches and in this paper 
we follow this naming convention. The standard way of carrying IP over ATM 
exploits the ATM Adaptation Layer 5 (AAL5) which does not provide native 
support for VC merging. 

This paper provides an overview of the VC merging problem and presents a 
review of the most common solutions proposed so far. It also presents CLIMAX, a 
solution that could fit in different scenarios to solve the VC merging problem. 

Keywords 

Multi-layer routing, stream merging, VC merging, ATM, Multi Protocol Label 
Switching. 

1 INTRODUCTION 

Multi-layer routing techniques are quite often conceived for generic layer 2 and 
layer 3 protocols; however, their most natural application seems to be the 
combination of ATM ( Asynchronous Transfer Mode) and the TCP/IP protocol 
suite in order to benefit from the performance of the former and the well-known 
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properties of the latter. Multi-layer routing techniques have proven not to be 
sufficiently scaleable [SCALE] if the ATM network does not allow Virtual 
Connections (VCs) to be merged , i.e., cells from different incoming VCs to be 
switched to the same outgoing link using the same Virtual Path Identifier/Virtual 
Channel Identifier (VPI/VCI) pair. This capability, known as VC merging, allows 
multipoint-to-point VCs to be implemented. VC merging is crucial also to the 
implementation of scaleable group multicast over ATM, since it requires 
multipoint-to-point VCs too. This paper focuses on the application of VC merging 
to multi-layer routing, but most of the drawn considerations apply also to the 
exploitation of VC merging in ATM multicast. 

In fact, VC merging is not the only possible solution to improve scalability in 
multi-layer routed (or switched) networks. In this paper many solutions that 
improve network scalability performing VP merging, instead of VC merging, are 
described. Thus, Stream Merging would be the best term to address the whole set 
of techniques; nevertheless, since the term VC merging has been traditionally used, 
we follow this naming convention in the paper. 

VC merging cannot be performed by common ATM switches when higher layer 
packets are transmitted using the services provided by the ATM Adaptation Layer 5 
(AAL5) [ITU_AAL], as recommended by most of the proposals for carrying both 
data and multimedia traffic over ATM networks. In fact, AAL5 relies on the ATM 
layer delivering all the cells over a VC in the same order they were sent and 
without misinserted cells. Instead, when VCs are merged by a switch, cells 
belonging to different VCs get mixed together and are not distinguishable any 
more. Many different approaches for supporting VC merging have been proposed 
so far, but none of them has still proven to be the best in every situation. Each of 
them is particularly suitable for a specific network environment and for specific 
needs. In Section 2 the proposals appeared so far are briefly described and then 
compared. Conclusions are drawn in Section 3. 

2. REVIEW 

After providing a framework for classifying VC merging approaches, their 
advantages and drawbacks are described highlighting: 

• hardware and software changes required in both core ATM switches and 
devices at the edge of the network; 

• performance in terms of delay, jitter, throughput, and buffer requirements; 

• specific problems. 

A comparison of the different approaches is presented in Section 2.1 1. 

2.1 Classification of VC merging approaches 

Various VC merging approaches have been proposed in the recent past; some of 
them present many similarities and they can be broadly classified in three 
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categories according to the philosophy adopted to solve the problem, as shown in 
Figure 1: 

• approaches based on avoiding cell interleaving; 

• approaches based on VP switching; 

• approaches based on AAL5 modification. 

Approaches based on avoiding cell interleaving cause intermediate switches not 
to forward cells belonging to different packets simultaneously on the same output 
VC. All the cells belonging to the same packet are gathered and then forwarded all 
together. Approaches based on VP switching adopt the VCI to identify the packet 
to which a cell belongs. Finally, approaches based on AAL5 modification 
introduce an identifier in the cell payload and use it in order to discriminate among 
cells carrying different packets and traveling on the same VC. As far as the two 
last approaches are considered, they could be further subdivided into two 
categories, according to whether the identifier is associated to a packet or a sender. 
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Figure 1 - A classification of VC merging approaches. 



2.2 MPLS Proposal 

The IETF’s Multiprotocol Label Switching (MPLS) working group proposes to 
avoid cell interleaving [MPLS ARCH]. ATM switches are modified to implement a 
special queuing policy for incoming cells traveling on merged VCs. Each switch 
queues all the cells belonging to a packet until a cell with the End Of Message 
(EOM) bit 1 set (an EOM cell, for short) is received. This indicates that a whole 
packet has been completely received and buffered. Then, all the cells are 
transferred to the output queue for transmission. This mechanism avoids that cells 
belonging to different packets get interleaved on the output link. 



1 The EOM bit is set by a transmitting AAL5 entity to identify the last cell of a packet. 
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Figure 2- Cell buffering in the MPLS approach. 



Figure 2 schematically shows the behavior of a switch implementing the MPLS 
approach. In Figure 2(a) no packet is being forwarded on the merged (outbound) 
VC, because none of the input buffered packets has been completely received. 
Cells belonging to incoming packets are being queued until their EOM cell is 
received. When the EOM cell of the gray packet is received, all the cells of the 
gray packet are transferred to the output queue at once, as shown in Figure 2(b). 
Note that even if some cells of the white and black packet have reached the switch 
before the gray ones, they will wait in input queues until their EOM cell is 
received, i.e., the whole white packet will be transmitted after the gray one. 

AAL5 is not modified and ATM switches are not required to parse the cell 
payload. Even though connection endpoints do not need any change, this approach 
modifies the forwarding paradigm of switches and this, in turn, implies hardware 
modifications in ATM switches. Messages are not forwarded cell by cell and thus 
switches do not feature the latency properties characterizing ATM. However, since 
packets are not required to be completely reassembled, the MPLS approach 
demands less processing and introduces shorter latency than packet forwarding at 
intermediate switches. The extra buffer capacity and the per packet queuing 
needed in ATM switches could limit scalability. 
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2.3 Simple and Efficient ATM Multicast 



Simple and Efficient ATM Multicast (SEAM) [SEAM] is very similar to the MPLS 
solution in buffering incoming cells until the EOM cell is received. Nevertheless, it 
aims at increasing throughtput by forwarding cells immediately (before arrival of 
the EOM cell) when the output link is idle ( cut-through ). Figure 3 shows how 
SEAM works; the output queue being empty, the switch immediately forwards the 
cells of the first packet it began receiving, i.e., the cells of the white packet in 
Figure 3(a). This prevents cells belonging to other packets from being forwarded; 
as shown in Figure 3(b), even if the EOM cell of the black packet is received, it 
waits in the input queue until the EOM cell of the gray packet has been moved to 
the output queue. 
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Figure 3 - SEAM approach to VC merging 



If the EOM cell of the packet being switched gets lost, cells belonging to other 
packets are blocked waiting for that EOM cell. A timer is used to overcome this 
problem; its duration is crucial and impacts significantly buffer requirement into 
switches. It should be determined on the basis of the bandwidth of the merged VC, 
the capacity of the links, and the load in the network. 

SEAM shares most of the MPLS characteristics and it is not clear if performance 
is really improved. Short-cutting packets can imply longer latency for short 
packets (such as TCP acknowledgment messages), when a long packet is being 
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forwarded. Furthermore, it is hard to determine a suitable timer value since it 
depends on many parameters; if it is too short, good packets could happen to be 
discarded during congestion, and if it is too long, it could seriously affect buffer 
requirement, latency and jitter. 

2.4 Cell Re-labeling at Merge-points 

CRAM {Cell Re-labeling at Merge-points) is essentially an optimization of SEAM. 
It introduces a new type of Resource Management (RM) cell, which carries the 
multiplexing information. When cells are received from more than one input link 
for the same merged VC, switches suspend forwarding on the merged VC until a 
reasonable number of cells has been gathered. Then, the cells arrived from 
different sources are sent on the outgoing link without interleaving them, i.e., 
grouped according to their source 2 . The trains of cells are preceded by an RM cell 
which contains a list of Source IDentifiers (SIDs) uniquely identifying the sender 
of each following group of cells. SIDs are ordered as the corresponding groups of 
cells on the merged VC. 

Two mechanisms to guarantee SID’s uniqueness are envisioned: 

• Globally Unique SID Allocation: SIDs are uniquely assigned by a central 
server or through a distributed mechanism. 

• Dynamic SID Allocation: network nodes dynamically remap the SIDs so that 
only local uniqueness is required. 

CRAM is not compatible with current core and edge devices, even if it does not 
require any actual modification to the AAL. It requires some minimal changes into 
switches in order to cope with the new type of RM cell and implement its specific 
queuing mechanisms. Moreover, some mechanism must be used to cope with the 
assignment of unique SIDs. Finally, the support to Early Packet Discard (EPD) 
must be re-implemented because it requires to parse the RM cell payload. 

2.5 Improved VP Switching and Merging 

The improved (or extended) VP switching [VPMERGE] has been proposed by the 
ATM Forum and it can be categorized among VP switching based approaches. It 
consists in merging ATM Virtual Paths (VPs). Cells belonging to packets coming 
from different sources are discriminated through a VCI uniquely assigned by each 
source. Improved VP switching features all the characteristics of ATM cell 
switching, thus allowing resource reservation and cell scheduling policies to be 
kept unchanged, not introducing additional delay in the (VP) merging points. 



2 The boundary of the groups does not have to coincide with a packet; a group can either contain cells 
belonging to one or more packets. 





225 



Sources must be provided with a method for identifying a unique VCI value 
which is chosen at connection setup. At least two different categories of VCI 
assignments could be identified: 

• Server-based : a central server in the network is responsible of the assignment 
of unique VCIs. 

• Signaling-based : VCIs are negotiated by neighbor nodes. 

VP merging presents the disadvantage of using a scarce resource, namely VP 
space, which limits the maximum number of merged connections on the same link. 
To overcome this problem, the improved VP switching approach proposes to 
enlarge the VPI field (18 bits) at the expense of a smaller VCI field (10 bits); the 
total cell label length is kept unchanged. This is not compatible with the standard 
operation of ATM switches and every cell must contain an indication of whether 
the switches should use the long or the short VPI field. The most significant bit of 
the VPI field is used to provide such an indication, thus halving the available VPI 
space. Moreover, implementation of improved VP switching requires ATM 
switches to be modified in order to cope with the new partitioning of the VPI/VCI 
field. 



2.6 Dynamic IDentifler Assignment 

The Dynamic IDentifler Assignment (DID A) approach [DID A] is similar to 
Improved VP Switching technique and it also comes from the ATM Forum. DIDA 
does not require packet reassembly at intermediate switches or usage of globally 
coordinated identifiers. DIDA assigns to each message a locally unique identifier 
which is inserted in the VCI field. Cells are routed according to their VPI, and the 
VCI is changed by each switch. The switch identifies any new VCI on incoming 
cells as the beginning of a new message and assigns a new locally unique VCI to 
the cell when it is transmitted on the outgoing port. 

There are two differences between DIDA and Improved VP Switching: 

• In the DIDA approach the VPI space, remains unmodified and is consequently 
smaller than in improved VP switching. 

• VCI semantics and assignment are different. In Improved VP Switching the 
VCI identifies the source of the cell, while in DIDA it identifies a packet (i.e., 
packets generated by the same source can have different identifiers). 

According to DIDA each identifier is assigned to a message only while it is 
traveling, thus requiring a small identifier space and no global uniqueness of VCIs. 
As well as Improved VP Switching, DIDA requires some modification to ATM 
switches which must modify the VCI in each cell, even though they do not use it 
for routing the cell. The number of merged connections across each port is limited 
to 4096, because the VPI field is not extended. 
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2.7 Double Identification Label Swapping 

Similarly to DIDA, the Double Identification Label Swapping (DILS) approach 
from IETF [DILS] uses a double level of identification for each packet. The first 
level identifies the destination and the second the source. DILS envisages three 
options for the location of the identifiers: 

1. The VPI identifies the destination and the VCI the source; the network 
performs VP switching. 

2. One half of the VPI/VCI space is used to identify the source and the other half 
the destination; switches route cells based on the second half. 

3. The VPI/VCI identifies the destination and the source identifier is placed in the 
cell payload; switches do not require any modification since routing is based on 
VPI/VCI. 

DILS needs an auxiliary protocol to assign source identifiers; hardware changes 
are needed only with options 2 and 3, listed above. Software changes will be 
needed, when implementing DILS according to option 1. Options 2 and 3 show 
higher scalability than option 1, because of the larger labeling space available. 

Performances are quite similar to those of Improved VP Switching and DIDA; 
cell switching is performed with neither extra delay introduced nor extra buffer 
capacity required. 

2.8 The Sink Tree Paradigm 

The sink tree paradigm [SINKTREE] is an innovative approach for ATM Local 
Area Networks (LANs) which is strongly based on VC Merging. Every switch in 
the LAN is the root of a multipoint-to-point VC (a sink tree) connecting it to all the 
other switches. A set of sink trees provides full connectivity among switches. 
Special cells called connectionless cells are transmitted over sink trees; they are 
differentiated according to a bit in the VPI field and are handled differently than 
regular ATM cells traveling over ordinary VCs. When a source host transmits 
connectionless cells carrying a packet to a destination host, the source switch 
places these cells on the sink tree associated with the switch of the destination host. 
The VPI/VCI fields of connectionless cells carry (l)the source and destination 
switch identifiers in order to identify the sink tree over which cells must travel, 
(2) the destination host identifier in order to allow the destination switch to deliver 
the cells to the proper host, and (3) the source host identifier. The latter enables the 
destination host to distinguish the cells coming from different sources and properly 
reassemble them even if many sources simultaneously transmit cells to the same 
destination host and they get interleaved while traversing the sink tree. 

Storing all this information in the VPI/VCI fields limits the scalability of the 
approach. In fact, the length of the source and destination switch identifiers is 8 
bits, and the length of source and destination host identifiers is 5 bits. This means 
that the largest LAN can span up to 256 switches, each having up to 32 hosts 




227 



directly connected, i.e., the maximum number of hosts allowed in a LAN is 8192. 
These numbers sound quite reasonable in a LAN environment, but prevent the 
scheme to be exploited in a wide area network. 

The Sink Tree Paradigm requires switches to be modified to route cells based on 
the portion of the VPI/VCI field which identifies the destination switch (i.e., the 
sink tree on which the cell must travel). Moreover, a protocol for building sink 
trees and accordingly configuring the forwarding tables of switches is necessary. 
Edge devices need modifications too since the basic principles for VC creation and 
management have changed. 

Even if the Sink Tree approach keeps the cell based forwarding paradigm typical 
of ATM, it is not suitable to the provision of service guarantees to applications in 
terms of controlled delay and jitter. In fact, switches cannot discriminate and 
properly handle the traffic of a specific application in order to provide it with the 
required quality of service. The finest possible granularity of traffic segregation 
into switches is the source-destination pair. 

2.9 AAL5 + 



In [AAL5+] the VC merging problem is solved through a new AAL slightly 
differing from AAL5. AAL5 + overcomes the problems due to cell interleaving by 
marking all the cells belonging to the same packet with a Message IDentifier 
(MID). Its value is assigned by sources on a per packet basis and it is randomly 
chosen in the range [0, 65535] with a uniform probability distribution. 
Destinations distinguish cells belonging to different packets thanks to the MID 
field and properly reassemble incoming packets, even if their cells got interleaved. 
Since the MID is chosen randomly, two or more messages may have the same 
MID at the same time. If their cells get interleaved the messages are lost because 
the destination cannot discriminate the cells belonging to the various messages. 
This phenomenon is called a MID conflict or a MID collision. MID conflicts are 
shown to be really rare and thus they are not explicitly handled. The upper layers 
reveal the incorrectness of packets affected by MID collision and discard them. 
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Figure 4 - AAL5 + SAR-PDU 



The MID field, which is 16 bit long, is placed in the ATM cell payload by the 
Segmentation And Reassebly (SAR) sub-layer of the AAL, as shown in Figure 4. 
Since AAL5 + uses two octets out of the 48 of the ATM cell payload to carry the 
MID, its efficiency is lower than AAL5’s one (i.e., 46/53=86.8% versus 
48/53=90.57%, respectively). AAL5 +, s efficiency is anyway higher than AAL3/4’s 
one (83%) which inserts a MID in each cell as well and could in principle 
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represent an alternative to support VC merging. Nevertheless, it is not used to this 
purpose because the ALL3/4 MID is intended for multiplexing on the same VC 
different kinds of traffic from the same source (not from different sources). Being 
AAL3/4 MID shorter (10 bits) than AAL5 + one, it is not suited to a random 
assignment, because the probability of MID collisions would be significantly 
higher. 

Even though the efficiency loss introduced by AAL5 + with respect to AAL5 is 
not a major issue, the 46 byte payload of the SAR-PDU is not large enough to 
allow a TCP control message (e.g., an acknowledgment message) to be fully 
contained into a single cell 3 . This halves the efficiency in the transmission of TCP 
control messages (e.g., ACK segments) since two ATM cells must be transferred 
instead of one. Actually, the default encapsulation method of IP packets over ATM 
networks, requires an LLC/SNAP ( Logical Link Control/SubNetwork Attachment 
Point ) header (8 bytes) to be put in front of each IP packet in order to allow for 
multiplexing of different upper layer protocols [RFC 1577] on the same VC. In this 
case TCP control messages do not fit anyway into the cell payload. Moreover, if 
IPv6 packets [RFC 18 83] are transmitted using AAL5, TCP control messages do 
not fit in a single ATM cell since the IPv6 header is 40 bytes by itself. 

2.10 CLIMAX 

The CLIMAX ( CelL-Interleaved Merged ATM connexions) approach [CLIMAX], 
analogously to AAL5 + , proposes the exploitation of randomly chosen 16 bit 
Message IDentifiers (MIDs) to allow cell interleaving at VC merging points. 
CLIMAX encompasses two possible implementations which basically differ in the 
way the MID is carried into cells. 

AAL5 + Based CLIMAX inserts the MID in the first two bytes of the cell payload 
using the same format proposed in [AAL5+]. 

VP Switched CLIMAX inserts the MID in the VCI field of the cell header. This 
requires a software modification at the transmitting side of end systems in order to 
randomly choose a VCI value for the cells resulting from the segmentation of the 
same packet. VP Switched CLIMAX does not require any modification to the 
hardware of both network nodes and end systems (or edge devices). ATM switches 
perform VP switching on CLIMAX merged connections and VC switching on 
other VCs. This solution has a clear scalability limit due to the small dimension of 
the VPI field. If switches must support both traditional ATM VCs and CLIMAX 
merged VPs, a bit in the VPI must be used to differentiate between the two kind of 
VCs and the space of the merged VP identifiers is consequently reduced. VP 
Switched CLIMAX is completely transparent to the destination, which will 
distinguish cells belonging to different packets in the same way AAL5 usually 



3 The TCP header (20 bytes), the IP header (20 bytes) and the AAL5 CS-PDU trailer (8 bytes) fit 
exactly in the ATM cell payload. 
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does. In fact, cells belonging to different packets arrive on different VCs, unless 
two different sources transmitting on the same VP have chosen the same MID to 
identify their packets (i.e., a MID collision takes place). 

As well as AAL5 + (see Section 2.9), CLIMAX does not try to avoid MID 
collision since, in reasonable operating conditions, the MID collision probability is 
low and the consequent loss is acceptable [CLIMAX-TR], especially if EPD is 
implemented into intermediate switches as briefly discussed below. 

The SAR sublayer in the receiver (of AAL5 + or AAL5, depending on the specific 
CLIMAX implementation) gathers payloads of cells with different MID values in 
different packet reassembly buffers. When an EOM cell is received, the SAR 
sublayer delivers the corresponding packet to the upper layer and releases the 
reassembly buffer associated with it. The memory required by the buffers 
concurrently used by the receiving entity to reassemble messages can limit the 
scalability of the approach. 




Event 


Description 


Action 


El 


Arrival of an EOP cell with MID m « CURR_MID_SET 


- Deliver the CPCS-PDU contained into the SAR-payload to the 
CPCS layer 


E2 


Arrival of a cell with MID m t CURR_MID_SET 


- Allocate a reassembly buffer for MID m 
• Add MID to CURR_MID_SET 
Start timer for MID m 


E3 


Arrival of a ceil with MID m e CURR_MID_SET 


• Reset timer for MID m 

Put the SAR payload into the reassembly buffer associated to MID 
m 


E4 


Arrival of an EOP cell with MID m € CURR_MID_SET 


Deliver the reassembled CPCS-PDU to the CPCS layer 
Release the reassembly buffer associated to MID m 
Remove MID m from CURR_MID_SET 


E5 


Timer for MID m expires 


Discard the incomplete CPCS-PDU 

Release the reassembly buffer associated to MID m 

Remove MID m from CURR_MID_SET 



Figure 5 - State-transition diagram of a receiving entity SAR sublayer in CLIMAX 

If an EOM cell gets lost, the reassembly buffer associated with one of the packets 
being reassembled will not be released anymore. If switches implement EPD or a 
similar packet discarding mechanism, this phenomenon is rare. Since these 
techniques try to discard only entire packets instead of cells belonging to different 
packets, they limit the number of incomplete packets delivered to the destination, 
thus lowering the number of unreleased buffers. Anyway, packet discarding 
techniques like EPD have not yet reached a large diffusion. 

The loss of EOM cells increases the probability of having MID collisions since 
each unreleased buffer is equivalent to keeping a MID in use until it generates a 
collision. CLIMAX strictly limits extra buffer allocation exploiting a buffer release 
timer. This reduces MID collision frequency and memory requirements in edge 
devices. Figure 5 shows the state-transition diagram of a SAR sublayer receiving 
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entity exploiting the buffer release timer. Each state represents the number of 
packets currently being reassembled or, equivalently, the number of reassembly 
buffers simultaneously allocated. A state change occurs when either a cell with a 
new MID or an EOM cell are received, or the timer associated to a particular MID 
expires. CURR_MID_SET is the set of the MIDs associated with packets being 
reassembled. 

The choice of the duration of the buffer release time-out is critical, as shown by 
preliminary results of ongoing simulation work [CLIMAX-TR]. A too 
conservative (too long) timer might result in the need for a large amount of buffer 
memory and an increased MID collision frequency. This is the reason why the 
traditional AAL5 time out mechanism which is too lose, is considered not 
effective. At the opposite end, if the buffer release timer is set too short (i.e., 
shorter than the maximum cell delay variation experienced in the network), 
partially reassembled messages may be discarded due to even a single cell which 
has experienced the maximum delay. 

Since buffering requirements at the destinations can affect the scalability of the 
approach, it is worth comparing CLIMAX with other approaches from this point of 
view. Destinations implementing CLIMAX allocate a buffer for each message 
being received on a multipoint-to-point VC. Assuming no loss of EOM cells, the 
maximum number of allocated buffers equals the number of sources transmitting 
concurrently on the merged VC. In a real scenario, EOM cells can get lost and 
reassembly buffer left open, but the exploitation of an effective buffer release timer 
can keep the number of buffers in use very close to the lossless case. Notice that 
when multipoint-to-point communications or multi-layer forwarding are performed 
without exploiting VC merging (i.e., group communications are implemented 
through a mesh of point- to-multipoint VCs and point-to-point VCs are used with 
multi-layer forwarding schemes) the total buffering capacity required in each 
receiver equals the number of sources, i.e., the upper bound for CLIMAX. This is 
because in each receiving node a different AAL5 entity must be instantiated to 
terminate each VC, with the consequent allocation of a reassembly buffer. 
Alternatively, when VC merging is performed by avoiding cell interleaving in 
merging points (e.g., like in the MPLS approach described in Section 2.2), the 
buffer space used by CLIMAX receiving entities is needed into switches. 

2.11 Comparison 

In Table 1 a comparison among the three classes of approaches discussed in 
Section 2 is outlined. The comparison is based on issues relevant to the production 
and deployment of these schemes (e.g., need for hardware modification). 

Hardware changes are needed in either edge or core devices, but most of the 
approaches do not require both of them. The approaches based on AAL5 
modifications require hardware changes in edge devices, while the others usually 
impact on the core of networks. Notice that usually in wide area networks the ratio 
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between core and edge devices is 1:20, making it simpler and preferable to change 
the former. AAL5 compatibility is obviously not granted by approaches based on 
AAL5 modifications, while it is generally preserved in the others. 



Table 1 - Comparison of the VC merging approaches 



Category 


No cell-interleave 


VP switching 


Modified AAL5 


Examples 


MPLS, SEAM, 
CRAM 


Impr.VPswitching, 
DIDA, DILS (opt. 1 
& 2), Sink Tree, 
CLIMAX 


DILS (opt.3), 
AAL5+ 


Hardware 
changes in edge 
devices 


No 


No 


Yes 


Hardware 
changes in the 
switches 


Yes 


No (if VPI/VCI 
partitioning is not 
changed) 


No (if EPD is not 
needed) 


AAL5 

compatibility 


Yes 


Yes 


No 


Label space for 
destination 


VPI/VCI (28 bits) 


VPI (12 bits) 


VPI/VCI (28 bits) 


EPD 

Compatibility 


Yes (changes 
needed) 


Yes (no changes) 


Yes (changes 
needed) 


Buffering 

required 


High 


Low 


Low 


Latency 


High 


Low 


Low 


Switching 


Pseudo-packet- 

switching 


VP level cell- 
switching 


Pure cell-switching 
(not for CRAM) 


QoS capabilities 


Low 


Medium (VP 
based) 


High (connection 
based), lower for 
CRAM 
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The label space per destination is an indicator of the scalability of the approach 
because, if limited, it can reduce the maximum number of edge devices that could 
be connected to the network. 

EPD compatibility indicates if any changes are needed in order to support packet 
discarding techniques, like EPD. Approaches based on avoiding cell interleave can 
support EPD, but they will need changes in ATM switches hardware. This is not an 
added limitation, since hardware changes are needed anyway in this case. 
Approaches based on usage of a packet identifier - either carried in the VCI field 
(VP switching) or in cells (AAL5 modifications) - could easily interoperate with 
current implementations of EPD, but it would be more effective to base packet 
discarding on the identifier used to support VC merging. 

Buffering, latency and switching method are considered significant performance 
indicators. The first one impacts on cost and complexity of switches while the last 
two affect the suitability of the approach for controlling delay and jitter. 
Approaches based on VP switching and those requiring modification of AAL5 
present better performances, while approaches based on the avoidance of 
cell-interleaving could have some limitations, especially when handling traffic 
other than best-effort. 

The Quality of Service (QoS) capability row expresses the suitability of the 
category of VC merging approach for guaranteeing QoS. Of course, the more cell 
switching and its properties are preserved, the higher the suitability for providing 
QoS guarantees. 

3. CONCLUSIONS 

A network using the standard IP over ATM protocol stack in intermediate and end 
systems does not allow Virtual Connections (VCs) to be merged. This feature is 
essential to allow for transmission of packets on multipoint-to-point VCs to either 
solve scalability problems in multi-layer forwarding or group multicast 
communications. This paper presents a survey of the most common approaches 
proposed so far to solve the ATM VC merging problem. The approaches are 
grouped into three categories which are compared according to issues relevant to 
the production and deployment of the required equipment. 

Currently, the mainstream approach to solve the VC merging problem in the 
context of the Multiprotocol Label Switching (MPLS) IETF’s working group is 
based on avoiding cell interleaving in merging points. (Modified) ATM switches 
buffer all the cells of a packet before starting to forward them; this represents a 
step away from cell switching towards packet switching. 

We consider CLIMAX a very promising approach due to its properties. It is easy 
to implement and operate, and since it implements traditional cell switching, it is 
suitable to the provision of Quality of Service (QoS) guarantees. Two CLIMAX 
implementations are possible: one based on usage of VP switching, the other based 
on a modification of AAL5 named AAL5 + . The latter has higher scalability, but 
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requires hardware changes in edge devices and thus, due to the large number of 
such devices, it is not an attractive solution for immediate deployment. 

We envision a migration towards the massive adoption of VC merging in ATM 
networks, where the most suitable short term solutions are VP Switched CLIMAX 
implementation (in small networks) and the MPLS approach (in large networks 
with a high ratio between the number of edge and core devices and with no QoS 
requirements). For the long term, the solution which will best combine scalability 
and cell switching performance is AAL5 + based CLIMAX. 
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Abstract 

This paper addresses the problem of supporting communications in parallel 
computing applications over ATM networks. We propose a mechanism specif- 
ically conceived for optimizing the cost-performance tradeoff in fairly long 
parallel executions. The proposed mechanism relies on a modified version of 
the loss recovery procedure of SSCOP, which is enhanced by means of a more 
intensive exploitation of ATM service categories in order to reduce the occur- 
rence of cell loss. For this purpose, we make use of both the UBR and ABR 
service categories, with ABR being only introduced in the periods of high 
latency. These periods are determined by periodically monitoring the experi- 
enced latency. This approach can achieve equivalent latency as the plain ABR 
service but with a use of this service of only 30%-70% of the parallel com- 
puting traffic, depending on the load of the network and the characteristics 
of the application. 
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1 INTRODUCTION 

The availability of a high-speed network with the flexibility of ATM (Asyn- 
chronous Transfer Mode), together with the current evolution of microproces- 
sor technology, is enabling the convergence of communications and computing. 
In addition, as defined by the ITU (International Telecommunication Union), 
ATM is the technology that will integrate the whole diversity of network-based 
services [1]. The combination of these three factors — high-speed networks, 
microprocessor technology, and integration — is facilitating the development 
of new applications requiring intensive communications. One of these appli- 
cations is the support to distributed parallel computing, where a number of 
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workstations connected to an ATM network can act as nodes of a parallel com- 
puting platform. Such environments cannot reach the performance achieved 
by more expensive, dedicated platforms such as multiprocessors, although 
they can be a sufficient replacement for many applications [2]. The main bot- 
tleneck in network-based parallel computing is experienced in the network 
itself, and is caused by the delays produced by the protocol processing, the 
interface with the network, and the processing within the network [3]. These 
issues occur despite the bandwidth enabled by ATM. 

In an integrated environment, communications in parallel computing ap- 
plications are not limited to LAN environments — which can be adequately 
supported by Myrinet or Gigabit Ethernet — but are suitable to be extended 
beyond the local area. The adoption of ATM allows for parallel computing 
applications to take advantage of an existing network, thus avoiding the un- 
derutilization of duplicated resources that would appear with the use of dedi- 
cated networks such as Myrinet. Thus, organizations whose parallel computing 
needs are not very intensive will be able to achieve satisfactory performance 
with a more efficient exploitation of resources. In this context, parallel com- 
puting environments have to subject to a number of conditions in order to 
achieve sufficient performance with cost-effectiveness. The first condition we 
assume is that parallel computing applications will share the ATM network 
with traditional networking applications. Thus, the network architecture will 
require the presence of mechanisms enabling the support of parallel comput- 
ing applications that can coexist with equivalent mechanisms for traditional 
networking applications. In order to preclude the increase in complexity that 
would arise with the enhancement of ATM with application-specific mecha- 
nisms, we consider that the adaption of parallel computing to ATM should be 
done with mechanisms implemented on top of ATM. Thus, ATM will solely 
support those services defined in the standards by ITU-T (Telecommunication 
Standardization Sector of ITU) and the ATM Forum [4]. 

Many of the applications to be integrated in ATM networks have strong 
bandwidth and/or delay requirements, as they manage continuous data streams. 
In these applications, network mechanisms should maximize the network ca- 
pacity, measured by throughput. In parallel computing applications, however, 
communications involve the exchange of relatively short pieces of data along 
a relatively long execution period, so the minimization of communications 
time has not a tight relationship to network capacity. Thus, communications 
in parallel computing applications approach to the request-response model, 
since each task sends data to other tasks and expects other data from them. In 
this model, per-message overheads set a limit on the achievable performance 
and therefore, as discussed in [5], latency is a measure that gives a clearer 
idea about communication performance in parallel computing applications. 
We consider the latency measure as embedding all per-message communica- 
tion costs which include, in addition to the costs of overheads and the delays 
from buffering and scheduling, the eventual need of recovering from cell loss 
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that results from the need of sharing the network with other networking ap- 
plications. 

In this paper, we propose a mechanism that enables low latency operation 
for communications in parallel computing environments over ATM networks 
integrating different networking applications. In particular, we concentrate on 
networks spanning outside the local area, where the impact of the applications 
sharing the network with parallel computing can be more significant. Thus, 
our mechanism will provide for a strategy to minimize latency degradation 
caused by the presence of background traffic, which is based on periodically 
monitoring the latency experienced by communications in real time, in order 
to achieve cost-effective performance. The rest of the paper is organized as fol- 
lows: Section 2 presents the general characteristics of the target environments 
for our mechanism. In Section 3, the particular features of the mechanism 
proposed in this work are extensively discussed, and their performance is 
evaluated in Section 4 Finally, Section 5 concludes the paper. 



2 PARALLEL COMPUTING IN AN INTEGRATED 
ENVIRONMENT 

Most environments to support parallel computing are addressed to operate 
on multiprocessors or high-speed LANs. Both of these environments can be 
very expensive when a large number of nodes is required by applications, so 
they are basically adequate for intensive use of parallel computing facilities. 
When the need of parallel computing support is not so intensive, it is better to 
allow for parallel computing environments to extend beyond the local area to 
provide appropriate scalability to parallel computing applications. In this case, 
LAN technologies like Gigabit Ethernet are not applicable, so the role of ATM 
as an integrating technology is more clear. Another important issue outside 
the local area is the greater influence of network load as more applications 
are then presumed to share the ATM network. In this paper, we assume 
that the scenario for distributed parallel computing over ATM will be based 
on a virtual network comprised by the endpoint hosts supporting the tasks 
of parallel computing applications, as well as other nodes implementing the 
procedures providing addressing, connection management, and other signaling 
functions. In this model, the endpoint hosts support the actual data transfer 
operations, while the rest of the nodes in the virtual network are in charge of 
establishing the necessary connections between the endpoint hosts in order to 
build the topology required for each particular parallel computing application. 
The signaling procedures operate before and after the actual execution, and 
are out of the scope of this paper. 

Figure 1 displays the architecture of the endpoint part of the ATM-based 
platform. Data transfer mechanisms for parallel computing are supported in 
a specific architecture to be integrated with the specific architectures of tradi- 
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tional networking applications. A proposal for the architecture of the parallel 
computing service is discussed in [6]. Three levels are considered: (1) Appli- 
cation level , which manages the specific requirements of parallel computing 
applications; (2) Network level , containing the functions provided by a par- 
ticular network technology, as ATM in the present work, and (3) Convergence 
level , which includes those functions that are required for an adequate sup- 
port of parallel computing applications, and are not provided by the network 
level as defined in the respective standards. 
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Figure 1 Integration of services over ATM. 



Communications in parallel computing applications are considered to be 
based on sequences of elementary data types — integer, float, double, etc. — , 
called PC-PDUs after “Parallel Computing Protocol Data Units” which are 
the minimum data structures understood by parallel tasks in a logical sense. 
Larger structures — arrays, structs, etc. — can be broken into these elementary 
PC-PDUs. The needs of bandwidth are not very high on average, because 
communications among parallel tasks are not occurring continuously but an 
arbitrary period of time can separate the issue of two consecutive messages. 
Nevertheless, in the particular instants when a message is submitted, very 
low latency is required in the network in order to minimize the impact of 
communications on performance. As the mechanisms in the convergence level 
have to satisfy all these requirements with a full guarantee of data delivery, 
this paper adopts the specific ATM Adaptation Layer (AAL) proposed in [6], 
which is based on a modified version of SSCOP (Service Specific Connection 
Oriented Protocol). SSCOP is a protocol defined by ITU-T in the Q.2110 
recommendation [7] for supporting a number of services requiring reliability on 
top of ATM. This specific AAL replaces AAL5 and improves performance by 
avoiding the retransmission of more cells than those effectively lost. With this 
AAL, applications are less sensitive to the network load induced by the rest of 
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applications sharing the ATM network, and communications achieve better 
latency performance. This AAL, however, does not rely on any particular 
ATM service category from those specified by the ATM Forum [4], In order 
to optimize performance, we can propose a modified version of the AAL that 
takes advantage of the features included with these service categories. 

The fact that parallel computing applications exchange relatively short mes- 
sages along a relatively long execution periods makes the use of guaranteed 
communications services — such as CBR (Constant Bit Rate) — not conve- 
nient. Instead, best-effort services as UBR (Unspecified Bit Rate) and ABR 
(Available Bit Rate) are the most appropriate service categories to support 
communications in ATM-based parallel computing environments, since their 
cost will rely mostly on the effective consumption of bandwidth, as opposed to 
other service categories where the length of connection period will be a more 
important issue. UBR is the least expensive service category, but the latency 
can be excessively high due to the cell loss occurring as the network load in- 
creases, while ABR is more expensive but faster, as the built-in flow control 
mechanism allows to achieve lower latency thanks to the fewer retransmissions 
needed. 

In addition, because of the long execution periods, a number of high activity 
and low activity periods may alternate in the network, as a result of the 
applications sharing the network. In the periods with low network traffic, the 
performance of UBR may be sufficient and, as a result, the higher cost of 
the ABR service category would not be amortized. Thus, for achieving cost- 
effective performance the data transfer should be conveyed through UBR when 
the latency experienced in the network and, when latency through UBR is 
excessively high, data transfer should be moved to an ABR-based connection. 
A procedure to monitor latency is therefore needed in order to determine 
when to activate the ABR service category. 

3 ENHANCED PARALLEL COMPUTING AAL 

We focus on the data transfers occurring during execution time by assum- 
ing that the necessary connections have been established prior to the execu- 
tion. As mentioned above, our proposal is conceived to provide cost-effective 
performance by adapting to the latency experienced by parallel computing 
communications. In the following we detail the operation of our mechanism, 
starting with an overview and continuing with a detailed description. 



3.1 Architecture 

Figure 2 depicts the architecture of our mechanism when the extensions to 
the Parallel Computing AAL are applied. In each communicating peer, the 
functionality is contained in two concurrent processes: (1) The Latency Mon- 
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itoring Engine (LME), which monitors the latency in the network in order to 
determine the periods in which significantly high latency is experienced, and 
(2) the Effective Communication Engine ( ECE % which performs the actual 
data transfers according to the information supplied by the LME. The com- 
munications between the engines are served by four connections between each 
pair of communicating endpoints: 



• A UBR-based connection with an unlimited peak rate, used by the ECE 
to transfer data when latency is low. We refer to this connection as the 
ordinary connection. 

• An ABR-based connection with a limited peak rate and a minimum bit rate 
set to zero, used by the ECE to transfer data when the LME indicates that 
latency is high. This connection is referred to as the backup connection . 

• A UBR-based connection like the ordinary connection, which used by the 
LME to monitor latency. In practice the same UBR connection is used for 
both purposes. 

• A VBR (Variable Bit Rate) connection with a guaranteed low peak in order 
to support a fast and reliable delivery of feedback information in the LME. 



The adoption of a VBR service category — whose cost is significantly higher 
than ABR — could compromise the objective for cost-effective performance of 
our mechanism. However, later in the paper we will observe that the adoption 
of a VBR-based connection does not significantly impact on performance of 
parallel computing applications. 



Sender Receiver 





Ordinary connection (UBR) 
Backup connection (ABR) 
Monitoring connection (UBR) 
Feedback connection (VBR) 



Figure 2 Mechanisms for extending the Parallel Computing AAL. 
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3.2 The Effective Communication Engine (ECE) 

The Effective Communication Engine (ECE) consists of an extension of the 
Parallel Computing AAL as described in [6] that allows to exploit the infor- 
mation supplied by the LME in order to achieve low latency communications. 
The mechanism discussed in [6] is based on a modification of the selective re- 
transmission procedure of SSCOP. The modification to SSCOP is addressed 
to limit the length of the frames to one cell. Thus, unlike standard SSCOP, 
the amount of retransmitted cells corresponds exactly to the lost cells and, as 
a consequence, applications become less sensitive to network load. This mod- 
ification is possible thanks to the short length of PC-PDUs — corresponding 
to elementary data types such as integer, float, etc., as noted above. In partic- 
ular, each cell encapsulates as many complete PC-PDUs as possible, so that 
the data can be integrated with computation as soon as received. In order to 
avoid the unnecessary overheads involved with the payload length and the 32- 
bit checksum of AAL5, the mechanism directly replaces AAL5, so it actually 
operates as a specific AAL. Figure 3 shows the structure of a cell generated 
by this specific AAL. 



48 bytes 
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DU 
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Pad 


□ 


□ 
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number | 
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< by,e8 > (bytea) 



Figure 3 Encapsulation scheme of the specific AAL for parallel computing. 



The cell structure shown in Figure 3 includes a significant amount of over- 
head. Although this overhead obviously leads to throughput degradations, 
we are more interested in optimizing message latency because of the small 
significance of throughput on the performance of communications in parallel 
computing. The overhead includes the following fields: 



• AAL-related fields , which only includes the CRC field. This 16-bit CRC 
allows to avoid the unnecessary 32-bit CRC provided in AAL5, which is 
not adequate for 1-cell AAL PDU. 

• SSCOP-related fields , which include the Sequence Number and PDU Type 
fields. These fields are directly inherited from standard SSCOP, but the Se- 
quence Number allows for a larger number space because restricting PDUs 
to one cell will presumably lead to a higher amount of PDUs. 

• Message-passing library fields , represented by the Tag, Offset First, and 
Offset Last fields. They are set to enable compatibility with the PVM 
(Parallel Virtual Machine) message-passing library [8], which is used by the 
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parallel programs we have tested. Other message-passing libraries would 
possibly require different fields. 

• Connection management fields , which include a Connection Identifier than 
supports an additional addressing level together with the VCI/VPI fields, 
in order to facilitate the implementation of a virtual network supporting 
parallel computing communications. 

The ECE enhances the Parallel Computing AAL described in [6] by con- 
sidering two operation modes: low-latency mode , and high-latency mode . The 
extension to the AAL applies essentially to the high-latency mode, which is ac- 
tivated when the LME detects a significant growth in the latency experienced 
in the ordinary connection. In this case, the backup connection is enabled, 
so that the cells transmitted on the original connection are switched to the 
backup connection in order to minimize the impact of cell loss on performance. 
Figure 4 outlines the operation of the ECE. 

Low-latency mode High-latency mode 



ranra 


UBR 




UBR 


Ordinary connection 


Ordinary connection 




ABR 


iti rn nn 


ABR 


Backup connection 


Backup connection 



Figure 4 Operation of ECE’s latency modes. 



(a) Low-latency mode 

In this mode, the operation of the ECE reduces to the mechanism of the 
Parallel Computing AAL just outlined. The transfers of data take place over 
the ordinary connection, so a UBR service is used. Latency monitoring by the 
LME takes place also over this ordinary connection using a UBR service. 

When the receiver part of the LME detects that a monitoring cell has been 
lost, or when the ECE itself considers that the measured latency is high — i.e. 
it exceeds a threshold Xm, the ECE activates the high-latency mode. For this 
purpose, it issues a new control frame, called LSTAT, which is equivalent to 
a STAT frame but contains also a time stamp corresponding to the instant 
when the offending monitoring cell was issued from the sender. LSTAT frames 
are sent through the same VBR service as USTAT and STAT frames. 

(b) High-latency mode 

When the sender (in low-latency mode) receives an LSTAT frame, it switches 
to the high-latency mode and triggers the retransmission of pending data, just 
as a STAT frame. Then, all cells are issued through the backup connection 
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only. As in low-latency mode, USTAT frames are generated when detecting 
cell loss. When the sender receives STAT and USTAT frames, the retrans- 
mission will be conveyed by the backup connection only. Thus, the operation 
of the ECE in high-latency mode is similar to the operation in low-latency 
mode, except for the fact that the backup connection (over an ABR service) 
is used instead of the ordinary connection (over a UBR service). 

When the latency monitored by the LME falls below a threshold T m , the 
low-latency mode is again activated by issuing an LSTAT frame to the sender, 
including again the information about the status of received cells as contained 
in STAT frames. The sender then retransmits the cells through the ordinary 
connection only. The threshold T m should be lower than the threshold Tm 
in order to avoid a continuous switching between both modes. In all cases, 
latency is monitored by the LME over the ordinary connection only (that is, 
over a UBR service) regardless of the operation mode. 



3.3 The Latency Monitoring Engine (LME) 

The goal of the LME is to provide an estimation of the latency experienced 
in the ordinary connection. For its implementation we have considered three 
decisions: 

• Averaging vs . instantaneous monitoring . Latency can be monitored by com- 
puting the average latency over a period of time. This is well suited for 
applications dealing with large chunks of data, like video and file transfers, 
but as this procedure has a slow response time, it is not convenient for 
applications generating more bursty traffic patterns. Therefore, we believe 
that instantaneous monitoring is a more adequate approach for parallel 
computing applications. 

• Asynchronous vs. periodic activation . Latency can be monitored either be- 
fore a burst of messages or in a periodic fashion. The former case forces the 
ECE to defer the transmission until the latency is monitored, so it involves 
a significant amount of latency. In contrast, the latter approach enables the 
ECE to avoid this delay. For this reason, we believe that a periodic LME is 
more adequate, despite the extra bandwidth required to support periodic 
monitoring. 

• The monitoring mechanism. We can consider the following options: (1) us- 
ing network-level information; (2) computing the Round Trip Time (RTT); 
and (3) synchronizing peers and using time-stamped information. The first 
approach requires the use of a ABR-like network level mechanism provid- 
ing accessible feedback information, which is not currently standardized 
within ATM. In the second case, the computed time depends on the la- 
tencies of both the monitored connection and the returning path, which 
are not necessarily equivalent. In the third approach, the experienced la- 
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tency is monitored by the receiver LME peer, so there is no influence of the 
returning path on the computed value. As a result, we adopt the third ap- 
proach as we find that it suits better the requirements of parallel computing 
communications. 

The operation of the adopted approach for the LME is as follows: the sender 
periodically submits a cell containing a time-stamp. When the receiver gets 
this cell, it compares its time-stamp to the time the receiver expected to get 
the cell. The measured latency corresponds to the difference between both 
times, and then the measurement is passed to the ECE so that it takes the 
appropriate action, which in the implementation of the ECE discussed above 
consists of replying to the sender if the monitored latency exceeds a threshold. 
As the time-stamped cells might be lost, when a certain amount of time Tl has 
elapsed since the expected time, the receiver warns the ECE of that circum- 
stance, meaning that a monitoring cell is possibly lost. Figure 5 illustrates the 
operation with an example. As an enhancement to this basic procedure, the 
cells issued by the ECE through the ordinary connection are also monitored 
their latency in order to reduce the response time of the whole mechanism. 




Figure 5 Operation example for the time-stamped LME. 

It is important to note that both sender and receiver must be synchro- 
nized to each other in order for the measurements to be significant. For this 
purpose, one of the peers has to report the other one on its current time 
with a certain periodicity. Thus, we consider two tasks included in the time- 
stamp LME: (1) Monitoring task , which deals with both the periodic and 
the ECE-originated time-stamped cells; and (2) Resynchronizing task , which 
guarantees that time values are consistent for both communicating peers. We 
can make use of the different service categories provided by ATM in order 
to implement these tasks. The Monitoring Task is carried out over the same 
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connection as the ordinary data transfers in the ECE, so it is supported by a 
UBR service. The Resynchronizing Task requires also high priority and, as it 
is periodic, a CBR service is more adequate. Note that the peak rates for the 
CBR service should keep low in order to avoid the allocation of an excessive 
amount of resources. The concrete value of the period depends mostly on the 
characteristics of the system clocks in both communicating peers, since the 
more diverging the clocks are, the more frequently the Resynchronizing Task 
should be activated. In the experiments presented below, we will assume both 
endpoints as perfectly synchronized and, therefore, no resynchronizing task is 
considered. 



4 PERFORMANCE MEASUREMENTS 

The goal of the mechanism presented so far is to allow for parallel computing 
applications to achieve satisfactory performance while keeping the cost not 
higher than strictly required. To characterize the performance, the average 
end-to-end latency has been measured in a simple configuration, in order 
to realize the impact of the mechanism. The cost of the mechanism is also 
determined and compared to that of the standard ABR service. 



4.1 Experiment configuration 

For moderate network sizes and buffer capacities, the most significant contri- 
butions to latency come from the bottleneck links in the ATM network, due 
to the cell loss and subsequent retransmissions occurring when becoming con- 
gested. Thus, the configuration shown in Figure 6 is sufficient for evaluating 
the performance of the proposed mechanism, and is simple enough to allow 
for simulations to keep within a reasonable duration. 




Figure 6 Simulated environment. 

All the links have a capacity of 155 Mb/s. Two types of sources are consid- 
ered: one data source modeling traffic from a real parallel computing applica- 
tion by means of a trace, and a number of background sources modeling traffic 
from traditional networking applications, by means of ON-OFF sources. The 
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traffic generated by the data source corresponds to the messages generated by 
one task of the parallel computing application. In contrast, the traffic from 
each background source represents the result of multiplexing many sources of 
traffic from traditional networking applications. 

Traces for the data source have been obtained from the execution of parallel 
codes from the GENESIS benchmark suite [9]. In particular, the considered 
codes have been PDE1 and PDE2 . PDE1 is a solver of the Poisson Equation on 
a three-dimensional grid by using red-black successive over-relaxation. PDE2 
solves a two-dimensional Poisson equation using a multigrid method. The 
traffic generated by PDE1 consists of relatively long bursts (around 8 KB). 
In contrast, bursts from PDE2 are much shorter (50-100 Bytes). As a result, 
different behavior is expected for each code. 

As far as background traffic sources are concerned, the values for the pa- 
rameters of both the ON and OFF states are exponentially distributed. In 
the measurements, several sets of values have been used in order to obtain 
diverse aggregate input rates. In particular, the network utilization p ranges 
from 0.3 to 1.1, with respect to the output link capacity, p stands for the 
average network load along the execution period. A value for p greater that 
1 indicates that the aggregate incoming traffic in on average higher that the 
output link capacity. As each background source models the result of multi- 
plexing several sources we do not want a very aggressive background traffic. 
Thus, the parameters of the ON-OFF models generate a traffic pattern with 
a burstiness not higher than required to capture the characteristics of multi- 
plexed cell streams. As demonstrated in several papers, for example [10], their 
burstiness decreases as long as the number of multiplexed sources grows. 

The switch is modeled as output-queued. Two priority levels are considered: 
one for guaranteed service categories (in particular VBR), and the other for 
best-effort service categories (ABR and UBR). The buffer space is shared by 
the logical queues associated with each priority level. The buffering scheme is 
basically drop-tail, except for the case of a full switch buffer, where the arrival 
of a non-UBR cell forces the dropping of an UBR cell already queued in the 
switch. The aggregate incoming traffic is arranged in order for the switch to 
contemplate it as a mixture of UBR and ABR traffic. The ABR scheduling 
algorithm adopted in the measurements in based on ERICA (Explicit Rate 
Indication for Congestion Avoidance), fully described in [11]. Table 1 shows 
the values for the most relevant parameters in the switch and the sources, 
which in turn are mostly based on the defaults suggested in [4, 11, 12]. Table 2 
displays the values for the parameters used in the performance evaluation 
study presented in this section. 
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Table 1 Values for the relevant parameters of the ABR service. 



Element 


Parameter 


Value 


Switch 


Target Utilization 


0.9 




Measurement Interval 


30 cells 


Source 


Nrm 


32 cells 




ADTF 


0.5 sec 




Peak rate 


50 Mb/s 



Table 2 Parameters for our low-latency mechanism. 



Parameter 


Value 


SSCOP POLL interval 


0.1 sec 


LME monitoring interval 


0.1 sec 


LME loss threshold Tl 


0.1 sec 


ECE latency threshold Tm 


0.0001 sec 


ECE latency threshold T m 


0.00009 sec 



4.2 Task-to-task latency 

Task-to-task latency is the measure determining the effective impact of com- 
munications on the performance of the parallel environment. As we assume 
that the ATM network is shared with other applications, we expect important 
variations on performance according to the load of the ATM network. Fig- 
ure 7 shows task-to-task latency as a function of the different values for the 
background load. We have compared our proposal for enhancing the Parallel 
Computing AAL with the AAL without these enhancements, the latter by 
considering both UBR and ABR as the service categories conveying the data. 





(b) Parallel code: PDE2 



Figure 7 Latency measurements. 

According to Figure 7, our mechanism achieves equivalent performance as 
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that obtained by relying on an ABR service all the time. However, to assess 
the actual advantages achieved by our mechanism we have to consider other 
facts, such as the effective utilization of the ABR service and the bandwidth 
consumption. The relative performance of the measured approaches depends 
on the particular characteristics of the communications in each application 
— traffic from PDE1 is much more bursty than traffic from PDE2 , as stated 
earlier. Nevertheless, in the next subsection it is observed that the ABR service 
is used only by the 30%-70% messages, depending on the application and 
the network load. Therefore, in addition to equivalent performance, great 
efficiency in resource exploitation may be achieved. 





(a) Parallel code: PDE1 



(b) Parallel code: PDE2 



Figure 8 Cell loss ratio experienced by the application. 



In order to assess the relationship between the performances achieved by 
both the original UBR-only mechanism and the ECE and the need of re- 
transmissions, we have measured the experienced cell loss ratio with these 
configurations. The results in Figure 8 confirm that retransmissions are a ma- 
jor cause of latency in ATM-based parallel computing environments, as shown 
by the close relationship between the ‘UBR only’ curves in Figures 7 and 8, 
and also that our proposal of ECE succeeds in reducing the amount of re- 
quired retransmissions, which is characterized in Figure 8 by a cell loss ratio 
close to zero in the ‘LME/ECE’ case. 

Due to the random component of the background traffic, several repetitions 
of the latency measurements have been performed. When considering a confi- 
dence level of 90%, the maximum radius for the confidence interval is 14% of 
the mean value in the worst case, which indicates a clear difference between 
UBR-only results and the rest. 



4.3 ABR service utilization 

We consider the fraction of the cells generated by a parallel task that used the 
ABR service as a measure of the utilization of this service. As the ABR service 
requires more resources from the network (a flow control mechanism, some 
kind of priority, etc.) than the UBR service (which just takes advantage of 
the bandwidth not consumed by the other service categories, so no particular 




249 



resources are allocated for it), the cost of information sent through ABR is 
also higher. 

100 

80 

i 

! “ 

I 40 
* 

20 
0 

(a) Parallel code: PDE1 (b) Parallel code: PDE2 

Figure 9 ABR service utilization measurements. 

Figure 9 displays the results of this measurement. PDE1 and PDE2 exhibit 
different behavior, as expected for the different characteristics of communica- 
tions. The following observations can be extracted: 

• In PD El , the proposed mechanism for the ECE achieves 40% utilization 
for p = 0.7 and 70% for p = 1. These results show a highly cost-effective 
service achievable by our mechanism. Thus, parallel applications whose 
communications follow a similar appearance as those of PDE1 can obtain 
a performance equivalent to that of the plain ABR service but with a higher 
efficiency in resource usage. 

• In PDE2 , the ECE achieves a slightly higher utilization of the ABR service 
— 40% for p — 0.65, and 70% for p — 1. In this case, the service remains 
cost-effective — although slightly less than PDEL The utilization of the 
ABR service is much less dependent on the application, as opposed to 
PDE1 . Thus, applications whose communication pattern is similar to that 
of PDE2 can equally achieve cost-effective communications. 

In order to realize the effective cost of our mechanism, we should take into 
account the cost of the VBR service conveying the feedback information. As 
illustrated below, its performance depends on the network load as well, so 
we can lose some of the advantage in cost-effectiveness, specially in a highly 
loaded network. 



4.4 Bandwidth consumption 

As explained above, our mechanism allows to obtain equivalent performance as 
that achieved by using exclusively an ABR service, with a fairly low utilization 
of the ABR service. However, these features are not for free. We have seen 
that feedback information uses a VBR service, whose cost is higher than the 
ABR service. 
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Table 3 Average bandwidth consumption experienced by PDE1 (Kb/s). 



p 


Service 


UBR only 


ABR only 


LME/ECE 


0.7 


UBR 

ABR 


278.8 


275.0 


154.0 

122.0 




Total 


278.8 


275.0 


276.0 




VBR 


- 


- 


5.0 


L0 


UBR 

ABR 


308.9 


275.0 


88.9 

190.2 




Total 


308.9 


275.0 


279.1 




VBR 


" 




5.1 



Table 4 Average bandwidth consumption experienced by PDE2 (Kb/s). 



P 


Service 


UBR only 


ABR only 


LME/ECE 


0.65 


UBR 

ABR 


285.9 


289.0 


216.0 

107.6 




Total 


285.9.8 


289.0 


323.6 




VBR 




- 


7.1 


1.0 


UBR 

ABR 


338.9 


289.0 


103.6 

224.5 




Total 


338.9 


289.0 


328.1 




VBR 




~ 


6.0 



Tables 3 and 4 reflect the bandwidth consumed in both PDE1 and PDE2 , 
considered as the total amount of bits transmitted along the execution period, 
by the services carrying the actual data for different two network loads in each 
case. The results show that, in both cases, the total consumed bandwidth is 
slightly higher with our mechanism than with the use of ABR only, and the 
difference is lower in PDE1. Using UBR only, as expected, yields the highest 
consumption due to the amount of retransmitted cells, except for PDE2 when 
p = 0.65 where the cell loss ratio is not high enough for the rest of mechanisms 
to become advantageous. Another observation from Tables 3 and 4 is that the 
fraction of bandwidth spent by the ABR service is closely related to the ABR 
service utilization displayed in Figure 9. 

Regarding the bandwidth spent by the VBR service, we recall that the 
VBR service conveys the STAT frames, which are periodically generated upon 
receipt of a POLL frame, as well as USTAT and LSTAT frames which are 
generated asynchronously. Thus, as expected, the spent bandwidth strongly 
depends on the cell loss ratio, which in turn is related to p. In particular, 
the higher the background load, the lower the consumed bandwidth, due to 
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the increased length of high-latency periods. Note that the significance of the 
bandwidth consumed by the VBR service is lower than the impact of the ABR 
service — it is equivalent to 3%-7% of the bandwidth consumption from ABR. 
Thus, the total cost for the evaluated approach remains advantageous. 



5 CONCLUSIONS 

In this paper, we have described and evaluated a mechanism to integrate com- 
munications generated by parallel computing applications in a private virtual 
network environment based on ATM. This mechanism has been designed to 
enhance the operation of a novel, specific AAL for parallel computing that 
was suggested in a previous work, which is based on a modified version of 
SSCOP. The mechanism presented in this work exploits the service categories 
provided by ATM. Typically, data applications use an ABR service to reduce 
the occurrence of cell loss, but the use of a UBR service when the network is 
unloaded can lead to similar performance. Thus, our mechanism for support- 
ing parallel computing applications uses UBR as the basic transfer service 
but, when latency experiences a significant increase, an ABR service is intro- 
duced. By means of this operation, we achieve low latency in communications 
and a cost-effective service. 

To evaluate the performance of our mechanism, we have undertaken a num- 
ber of simulation-based experiments. In particular, we have measured the 
end-to-end latency and cell loss ratio experienced by communications, the 
utilization of the ABR service category, and the bandwidth consumption, par- 
ticularly of the VBR service category. In view of the results yielded by these 
measurements, we observe that (1) the latency achieved by our mechanism 
is equivalent to the latency experienced when conveying all communication 
through ABR-based connections; (2) as in the worst case only 70% of cells use 
the ABR service category, the cost of communications with our mechanism 
is much lower than the cost inherent to the full use of ABR-based connec- 
tions; (3) the LME succeeds in determining the high-latency periods, since 
our mechanism has been able to avoid most of cell loss; and (4) the bandwidth 
consumption is moderate and the requirements for the VBR service category 
are sufficiently low, so the cost of communications is not significantly affected. 
As a summary, our mechanism allows for parallel computing applications that 
execute for a significantly long period to achieve cost-effective performance. 

As introduced earlier, the mechanisms suggested in this work implement 
only the data transfer part of ATM-based parallel computing environments. 
Given that we want these platforms to extend beyond the local area, the 
mechanisms to build and manage parallel computing environments should be 
defined. In particular, these mechanisms should include a user interface in 
order to facilitate the platform setup, as well as intelligent load balancing al- 
gorithms so that optimal performance can be achieved at each time according 
to the available resources. For longer term research, we believe that applica- 
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tions other than parallel computing may also benefit from similar mechanisms 
and architectures, and therefore these can be adapted in order to advance in 
the integration of services. 
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Abstract 

With the grown popularity of the Internet and the increasing use of business and 
multimedia applications the users’ demand for higher and more predictable qual- 
ity of service has risen. A first improvement to offer better than best-effort services 
was made by the development of the integrated services architecture and the RSVP 
protocol. But this approach proved only suitable for smaller IP networks and not 
for Internet backbone networks. In order to solve this problem the concept of dif- 
ferentiated services has been discussed in the IETF, setting up a working group in 
1997. While RSVP classifies packets according to application flow properties, dif- 
ferentiated services are based on the idea that the user negotiates a service profile 
with his Internet service provider (ISP) for specially marked packets and then trans- 
mits marked packets over the ISP network. A further significant difference to RSVP 
consists in the fact that for scaling reasons the service profile is only negotiated and 
policed for a set of aggregated flows. This article gives an overview of the activities 
of the IETF with regards to differentiated services and presents several proposals for 
the implementation of differentiated services. 



Keywords 
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1 INTRODUCTION 

A central problem of today’s Internet exists in the mostly unpredictable service and 
the often very low quality of transmission. At present there does not exist any satis- 
factory solution to this problem. 
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Figure 1 Monitoring of QoS using RTCP and RTP 



A pragmatic approach to achieve good quality of service (QoS) is an adaptive 
design of the applications to react to changes of the network characteristics (e.g. 
congestion). Immediately after detecting a congestion situation the transmission rate 
may be reduced by increasing the compression ratio or by modifying the A/V coding 
algorithm. For this purpose functions to monitor quality of service are needed. For 
example, such functions are provided by the Real-Time Transport Protocol (RTP) 
(Schulzrinne et al 1996) and the Real-Time Control Protocol (RTCP). The receiver 
in Figure 1 measures the delay and the rate of the packets received. This information 
is transmitted to the sender via RTCP. With this information the sender can detect if 
there is congestion in the network and adjust the transmission rate accordingly. This 
may affect the coding of the audio or video data. If only a low data rate is achieved, a 
coding algorithm with lower quality has to be chosen. Without adaptation the packet 
loss would increase, making the transmission completely useless. 



1.1 Integrated Services and RSVP 

Adaptive methods have their limitations when an application requires a certain mini- 
mum bandwidth to achieve a reasonable QoS. In these cases a minimal QoS has to be 
guaranteed by resource reservation. Special applications with real-time requirements 
depend on resource and bandwidth reservation. This is the reason why the Integrated 
Services (IntServ) working group defined several services which extend the simple 
best-effort service: the Controlled Load Service and the Guaranteed Service. 

These services are provided for flows i.e. application data streams between end 
systems. For example three flows exist in Figure 2, two from sender S to the re- 
ceiver R1 and one flow from S to R2. Between the sender and R1 several applications 
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Figure 2 Application flows 



may be active (e.g. a data transmission via FTP and a terminal emulation), or an 
application may support several flows at the same time (a WWW browser opens 
TCP-connections to a server). 

The resources for a flow are reserved within the end systems and the routers using 
signaling protocols. For this reason, network elements like routers, nodes and even 
the operating systems within the end systems have to check whether sufficient CPU 
time, memory and network bandwidth are available in order to provide a certain 
service (admission control). Resources then have to be reserved and assigned to the 
packets of the respective flow (scheduling). Finally, the compliance to the negotiated 
traffic characteristics has to be monitored (policing). 

The Resource Reservation Setup Protocol (RSVP) has been developed as a sig- 
naling protocol for resource reservation (Braden et al 1997). RSVP extends the IP 
protocol stack, i.e. data is transmitted unchanged using IP. It exchanges only signal- 
ing information describing the QoS to be given to the TCP/IP or UDP/IP flows. The 
RSVP resource reservation is receiver-oriented. The receiver generates the reserva- 
tion message containing the desired service parameters for the received application 
data flow. 

RSVP has been criticized mainly for its limited applicability in large IP networks. 
The RSVP working group of the IETF has evaluated the applicability of the current 
version. The flow-based approach is considered as the main problem of RSVP since 
resources are reserved for every single flow. This cannot be realized with conven- 
tional routers if large networks with millions of users and possibly several flows per 
user have to be supported. Routers are not able to store such a huge number of flow 
states because of limited memory resources. Secondly, the amount of flows will in- 
crease the complexity of packet scheduling in the routers. Scheduling is essential for 
guaranteed services. A further disadvantage is the lack of standards for accounting 
and billing, making resource reservation as a result quite unrealistic. For these rea- 
sons it is recommended to use RSVP only in small confined networks (Mankin et 
al 1997). 
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Figure 3 DS byte in IPv4 and IPv6 



2 DIFFERENTIATED SERVICES: BASICS AND TERMINOLOGY 

A demand for higher-level services apart from best-effort has been recognized, but 
these services cannot be realized using the integrated services approach, particularly 
in large IP networks. The differentiated services model tries to avoid the disadvan- 
tages of best-effort networks and the integrated services approach. 

The idea of differentiated services is based on the aggregation of flows, i.e. reser- 
vations have to be made for a set of related flows (e.g. for all flows between two 
subnets). Furthermore, these reservations are rather static since no dynamic reserva- 
tions for a single connection are possible. Therefore, one reservation may exist for 
several, possibly consecutive connections. 

IP packets are marked with different priorities by the user (either in an end system 
or at a router) or by the service provider. According to the different priority classes 
the routers reserve corresponding shares of resources, in particular bandwidth. This 
concept enables a service provider to offer different classes of QoS at different costs 
to his customers. 

The differentiated services approach allows customers to set a fixed rate or a rel- 
ative share of packets which have to be transmitted by the ISP with high priority. 
The probability of providing the requested quality of service depends essentially on 
the dimensions and configuration of the network and its links, i.e. whether individual 
links or routers can be overloaded by high priority data traffic. Though this concept 
cannot guarantee any QoS parameters as a rule it is more straightforward to im- 
plement than continuous resource reservations and it offers a better QoS than mere 
best-effort services. 

For packet marking the so-called DS by te (for differentiated services) in the header 
of each IP packet is mapped to the IPv4 Type Of Service octet (TOS) or to the IPv6 
Traffic Class octet (Figure 3). Six bits of this byte are used to define the per-hop 
behavior (PHB) that a packet experiences in each router. The remaining two bits 
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correspond to the currently unused (CU) field which is reserved for purposes not yet 
specified and may be assigned later. 

The meaning of the individual bits in the PHB field are not yet standardized and are 
part of ongoing discussions in the Differentiated Services working group (DiffServ) 
of the IETF. The proposal in (Baker et al. 1998) suggests to use one bit for tagging in 
and out of profile packets and to distinguish service classes with different priorities 
using the other five bits. Thus, a minimal backward compatibility to the TOS field 
in IPv4 can be kept (Nichols et al. May 1998) suggests to standardize two different 
services Default (DE) and Expedited Forwarding (EF) by using two code points in 
the PHB field. Since the value of the PHB field for a certain service may change at the 
edge of different ISP networks because of missing standards, it might be necessary 
to change the value of the PHB field at the border of two networks. 

It has to be pointed out that size, meaning and name of the bit fields in the 
DS byte are subject of further discussions within the DiffServ working group and 
might change again in the near future. Therefore, the explanations presented here 
are merely a representation of the status quo of the DiffServ working group. Several 
sites on the WWW, which are referenced at the end of this text, contain up-to-date 
information of the exact DS byte definition and should be consulted first of all. 



3 SERVICES OF THE DIFFERENTIATED SERVICES APPROACH 

At present, several proposals exist for the realization of differentiated services. The 
approach allowing the combination of different services like Premium and Assured 
Service seems to be very promising. In both approaches absolute bandwidth is allo- 
cated for aggregated flows. They are based on packet tagging indicating the service 
to be provided for a packet. 

A similar idea is pursued by the Scalable Resource Reservation Protocol (SRP). 
Flows are aggregated automatically at each link, so that the network does not have 
to know every single flow. No particular signaling protocol is deployed. Only three 
different packet types (RESERVED, REQUEST, BEST-EFFORT) are introduced, 
which differ by the tag in the packet header. 

An alternative approach (user-share differentiation, USD) assigns bandwidth pro- 
portionally to aggregated flows in the routers (for example all flows from or to an IP 
address or a set of addresses). A similar service is provided by the Olympic service. 
Here, three priority levels are distinguished assigning different fractions of band- 
width to the three priority levels gold, silver and bronze, for example 60% for gold, 
30% for silver and 10% for bronze. 

In the following these services are described in more detail. 



3.1 Premium Service 

With Premium Service the user negotiates with his ISP a maximum bandwidth for 
sending packets through the ISP network. Furthermore, the aggregated flow is de- 
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Figure 4 Premium Service 



scribed by the packets’ source and destination addresses. In Figure 4 users and ISPs 
have agreed on a rate of three packets/s for traffic from A to B. The user config- 
ures the first-hop router in the individual subnet accordingly. In the example above a 
packet rate of two packets/s is allowed in every first-hop router as it can be expected 
that no two end systems will use the full bandwidth of two packets/s at the same 
time. 

First-hop routers have the task to classify the packets received from the end sys- 
tems, i.e. to analyze if the Premium Service shall be provided to the packets or not. If 
yes, the packets are tagged as Premium Service (P-bit) and the data stream is shaped 
according to the maximum bandwidth. The user’s border router re-shapes the stream 
(e.g. three packets per second) and transmits the packets to the ISP’s border router, 
which performs policing functions, i.e. it checks whether the user’s border router re- 
mains below the negotiated bandwidth of three packets/s. If each of the two first-hop 
routers allows two packets/s, one packet per second will be dropped by shaping or 
policing at the border routers. All first-hop and border-routers own two queues, one 
for packets with the P-bit set and one for all other (see Figure 4). If the P-queue con- 
tains packets these are transmitted prior to others. The implementation of two queues 
in every router of the network (ISP and user network) equals to the realization of a 
virtual network for Premium Service traffic. 

Premium Service offers a service corresponding to a private leased line, with the 
advantage of making free network capacities available to other tasks, resulting in 
lower fees for the users. 
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Figures Assured Service 



3.2 Assured Sendee 

A potential disadvantage of Premium Service is the weak support for bursts and the 
fact that a user has to pay even if he is not using the whole bandwidth. The Assured 
Service tries to offer a service which cannot guarantee bandwidth but provides a high 
probability that the ISP transfers high-priority-tagged packets reliably. The definition 
of concrete services has not yet happened, but it is obvious to offer services corre- 
sponding to the controlled load service. The probability for packets to be transported 
reliably depends on the network capacity. An ISP may choose the sum of all band- 
widths for Assured Service to remain below the bandwidth of the weakest link. In 
this case, only a small portion of the available capacity may be allocated in the ISP 
network. An advantage of the Assured Service is that users do not have to establish 
a reservation for a relative long time. With ISDN or ATM, users might be unable to 
use the reserved bandwidth because of the burstiness of their traffic, whereas Assured 
Service allows the transmission of short time bursts. 

With the Assured Service the user negotiates a service profile with his service 
provider, e.g. the maximum amount or rate of high priority, i.e. Assured Service, 
packets. The user may then tag his packets as high priority within the end system or 
the first-hop router, i.e. tag them with an A-bit (see Figure 5). To avoid modifications 
in the end systems the first-hop router may analyze the packets with respect to their 
IP addresses and UDP-/TCP-Port and then assign them the according priority, i.e. 
set the A-bit for conforming Assured Service packets. The maximum rate of high- 
priority (A-bit) packets must not be exceeded. This is done by (re-)classification in 
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Figure 6 Receiver-oriented realization of Assured Service 



the first-hop routers and in the user’s border routers at the border to the ISP network. 
Nevertheless, the service provider has to check if the user remains below the max- 
imum rate for high priority packets and apply corrective actions such as policing if 
necessary. 

For example, the border router at the network entrance will tag the non conforming 
packet as low priority (out of service, out of profile). An alternative would be to 
charge higher fees for non conforming packets by the ISP. The tagging of low and 
high priority packets is done by use of the DS byte. 

Bursts are supported by making buffer capacity available for storing bursty traffic. 
Inside the network, especially in backbone networks bursts can be expected to be 
compensated statistically. 



(a) Receiver-oriented scenarios 

One problem of the Assured Service is the negotiation of the service profile between 
the sender and the ISP. If an Internet user connects to a WWW server, the receiver 
should be able to determine the transmission quality and take over the costs. There- 
fore, the receiver should be able to set up a user profile with the ISP. At the border 
between the ISP and the receiver’s network a border router knows the profile agree- 
ment (see Figure 6). This router checks whether the received data flow conforms to 
the service profile. Otherwise, the ISP’s border router sets the forward congestion 
notification (FCN) bit. 

This bit might also be set by routers in the network to indicate a congestion sit- 
uation. If the packet conforms to the profile the border router resets the bit. For a 
set FCN-bit the receiver has to slow down the sender’s data flow, e.g. by delaying 
TCP-acknowledgments or by the setting of flow control information. If the receiver 
does not react, the border router may drop future packets. 



(b) Adaptation of applications 

The Assured Service can be combined with the concept of application adaptation. 
An application can monitor via RTP/RTCP the throughput respectively the loss rate. 
According to this, more or less packets might be tagged as high priority. If the net- 
work is idle the application might transmit best-effort instead of high-priority packets 
and save costs. On the other hand the application has to increase the number of high 
priority packets, if a high loss of low priority packets is detected. 

The maximum rate of high-priority packets has to be re-negotiated with the service 
provider, requiring the support of dynamic reconfiguration or signaling. 
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Figure 7 First-hop router for Premium and Assured Service 



33 Router implementation for Assured and Premium Service 

The implementation of Assured and Premium Service requires several modifications 
of the routers. Mainly classification, shaping, and policing functions have to be per- 
formed to the router. These functions are necessary at the border between two net- 
works, for example at the transition of the customer network to the ISP or between 
the ISPs. Service profiles have to be negotiated between the ISPs similar to the tran- 
sition to the user. 

(a) First-hop router 

Figure 7 shows the first-hop router function for Premium and Assured Service. Re- 
ceived packets are classified and according to this the A- or P-bit is set if the packet 
should be supported with Assured or Premium Service. As parameter for the classifi- 
cation source and destination addresses or information of higher protocols (e.g. port 
numbers) may be used. A pure best-effort packet will directly be forwarded to the so- 
called RIO-queue. Also, the Assured Service packets get to this queue. The Assured 
Service packets are checked whether they conform to the service profile. The A-bit 
will only be kept if the Assured Service bucket contains a token. Otherwise the A-bit 
will be deleted and the packets are handled as best-effort packets. The RIO-queuing 
shall guarantee that best-effort packets are dropped prior to Assured Service packets, 
if the capacity is exceeded. 

(b) Border router 

Similar to the first-hop router an intermediate router has to perform shaping functions 
in order to guarantee that not more than the allowed packet rate is transmitted to the 
ISP. This is important since the ISP will check whether the user remains within the 
negotiated service profile. The border router in Figure 8 will therefore drop non 
conforming Premium service and reset the A-bit of non conforming Assured Service 
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Figure 8 Policing in a border router 



packets. Assured Service and best-effort packets share the same queue since both 
types of packets may belong to the same source. A common queue avoids re-ordering 
of packets. This is especially important for TCP performance reasons. 

(c) Queuing 

An important element in the implementation of Premium and especially Assured 
Service is a proper procedure for dropping packets in overload conditions. To dis- 
tribute the available bandwidth fairly among the flows in congestion situations, it is 
recommended to identify and to drop packets of aggressive data flows. 

The fundamental mechanism suggested therefore is the Random Early Detection 
(RED) mechanism. RED is a new technique for router queue management and is 
supposed to eliminate disadvantages of traditional queuing mechanisms. 

With traditional queuing every supported queue accepts packets as long as possi- 
ble. If there is no space left in a queue arriving packets are dropped, i.e. the packets 
at the end of the queue are discarded. This method has two significant disadvantages: 

• If bursts arrive at nearly full queues, the likelihood for packets of the burst to 
get lost is high. But queues are also intended for buffering packets in the case of 
bursts. Therefore, it is recommended to provide space for those bursts. 

• Full queues cause higher delays than queues with lower utilization. Especially for 
real time or interactive applications higher delays are not desired. 

RED (Braden et al. 1997) is a mechanism trying to keep the queue length below 
a certain limit in order to provide space for bursts. This is achieved by dropping 
packets even if the queue length is relatively small (see Figure 9). 

Below the lower threshold no packets are dropped. The more the queue length 
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exceeds the lower threshold, the higher is the likelihood for dropping a received 
packet. The dropping is done randomly to prevent dropping the packets of a certain 
application data flow. 

If the queue length reaches the upper threshold, all packets are dropped. With this 
mechanism the following advantages shall be achieved: 

• Bursts can be supported better since there is always a certain queue capacity re- 
served for incoming bursts. 

• By the lower average queue length the delays are reduced, providing better sup- 
port for real time applications. 

RED is especially capable of dividing the available bandwidth fairly among TCP 
data flows, as packet loss automatically leads to a reduction of an TCP data flows 
packet rate. The situation with non TCP conforming data as for example real-time 
applications based on UDP or multicast applications without an adaptation or flow 
control mechanism is more problematic. They have to be treated special to prevent 
them from overloading the network. 

The queuing algorithm RIO (RED with In and Out) (Clark et al. 1997) has been 
suggested for Assured Service implementation. RIO is an extension of the RED 
mechanism. A common queue is provided for in-profile and out-of-profile pack- 
ets, but different dropping procedures (dropper) are applied. The dropper for out- 
of-profile packets (out-dropper) drops discards packets earlier i.e. at a substantially 
lower queue length, than the dropper for in-profile packets, i.e for packets with set 
A bit. Moreover, the dropping probability of the out-dropper increases more rapidly 
than the probability of the in-dropper (see Figure 9). This tries to keep the dropping 
probability of in-profile packets low. 

For the implementation of different service types routers have to support several 
queues, e.g. a queue for Assured or Premium Service. Special bits, e.g. in the TOS 
field or in the traffic class field of IPv4 respectively IPv6 indicate which service shall 
be provided to the packet. 



3.4 User-Share Differentiation 

Based upon packet tagging Premium and Assured Service models can fulfill the stip- 
ulated service parameters like bit rates with a high degree of probability only if the 
ISP network is dimensioned appropriately and non best-effort traffic is transmitted 
between certain known networks only. 

If for instance two users have contracted a bit rate of 1 Mbps for Assured Service 
packets with an ISP and both wish to receive data simultaneously at a rate of 1 Mbps 
each from a WWW server which is connected to the network with a 1.5 Mbps link, 
the requested quality of service cannot be provided. 

The User-Share Differentiation approach (Wang 1997) avoids this problem by 
contracting not absolute bandwidth parameters but relative bandwidth shares. A user 
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Figure 9 The queuing algorithm RIO 



will be guaranteed only a certain relative amount of the available bandwidth in an 
ISP network. In practice, the size of this share will be in direct relation to the charged 
costs. 

In Figure 10, user A has allocated only half of the bandwidth of user B and one 
third of the bandwidth of user C. If A and B access the network on low bandwidth 
links with a capacity of 30 kbps at the same time, e.g. user B will receive a bandwidth 
of 20 kbps but user C will get merely 10 kbps. If B and C access the same or possibly 
a different network via a common high bandwidth link with a capacity of 25 Mbps, 
B will receive 10 Mbps and C only 15 Mbps. 

Simpler router configuration is an important advantage of the USD approach. 
However, absolute bandwidth guarantees cannot supported. 



3.5 Olympic Service 

The Olympic Service (Nichols et al February 1998) specifies an appropriate service 
to be deployed within an ISP or a domain. Deployment of this service requires the 
implementation of a rate-based link share scheduler behavior at each hop. Three 
service levels are distinguished: gold, silver and bronze. In case of a congested link 
packets with ’’Olympic gold” service will get a larger share of the link than packets 
sent using the ’’Olympic silver” service which in turn get a larger share than packets 
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Figure 10 User Share Differentiation (USD) 



with "Olympic bronze” service. When there are no packet flows of gold or silver, 
packets with "Olympic bronze” service may utilize the entire output link. 

By marking packets for a link share flows are classified at a boundary. The exact 
method of service discrimination is not specified but should be selected in a way that 
it makes a perceptible difference to customers. A possible configuration of the link 
sharing could be to allocate 60% for gold, 30% for silver and 10% for bronze, al- 
though different configurations could be thought of. Customers do not specify a par- 
ticular traffic profile for the Olympic Service nor is there admission control, shaping 
and policing of flows in any way. 



3.6 Scalable Reservation Protocol 

The Scalable Resource Reservation Protocol (SRP) developed at the Institute for 
Computer Communications and Applications (ICA) of ETH Lausanne represents 
yet another proposal in addition to Assured and Premium Service for a possible im- 
plementation of differentiated services in the Internet (Almesberger et al. 1997). As 
indicated by its name much effort has been spent on making the protocol well scal- 
able even for large numbers of packet flows. End systems (i.e. sender and receiver) 
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play an active part in resource reservation while additional control of the sender’s 
behavior is done at the affected routers. Each router aggregates all incoming data 
flows and monitors this aggregated data stream in order to estimate the necessary 
resources (now and in future) at that node. 

The so-called estimators play an important role in the process of resource reser- 
vation. It is their job to estimate the amount of resources needed for reservation. 
Estimators are deployed in the sender, the receiver and the routers in between. At 
the sender it helps to make an (optimistic) prediction on the required reservation of 
network resources for the data to be transported. The estimator of the receiver com- 
putes a (conservative) estimation of the resources actually reserved by the network 
and periodically sends this information back to the sender. 

Without requiring explicit signaling of flow parameters the reservation mechanism 
consists of a reservation protocol and a feedback protocol which will be discussed in 
the following. 

(a) The Reservation Protocol 

The reservation protocol is deployed from sender to receiver requiring that sender, 
receiver and all routers in between have implemented this protocol. Three different 
packet types are distinguished by a tag to be defined in the packet headers. 

REQUEST Packets marked as REQUEST belong to flows wishing to reserve net- 
work resources. If a router forwards such a packet he agrees to accept packets 
tagged as RESERVED in the future at the same transmission rate. Thus, an im- 
plicit reservation at the router takes place. 

RESERVED If there exists already a reservation at the router and if packets marked 
as RESERVED arrive at a rate agreed-upon in an earlier stage, the router has to 
forward them and must not discard them. 

BEST-EFFORT No reservations exist in the nodes for these packets, and the pack- 
ets may be deleted by the routers in case of congestion. This service corresponds 
to today’s best-effort service of the Internet. 

A sender wishing to make a reservation begins with the transmission of data pack- 
ets marked by him as so-called REQUEST packets, which already contain the appli- 
cation data. On arriving at a router they are inspected by admission control functions. 
They monitor the arriving aggregated flow of packets tagged as RESERVED and es- 
timate the amount of local resources needed to maintain a ’’good” quality of service. 
These resources consist of the available bandwidth, the buffers’ sizes and further lo- 
cal resources of the router. When the router receives a packet tagged as REQUEST 
for forwarding it has to decide whether the QoS will deteriorate by adding the packet 
to the existing RESERVED-flow. If this is not the case, the packet, which continues 
to be marked as REQUEST, can be forwarded, and the estimator of the router has to 
be accordingly updated. 

If the necessary additional resources are not available, the packet is degraded to 
best-effort service by appropriate tagging before being forwarded. In particular, no 
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reservations are performed at the router. Packets marked as BEST-EFFORT or RE- 
QUEST may be deleted by a router in case of congestion. An end-to-end reservation 
is only achieved if packets arriving at the receiver are still marked as REQUEST, i.e. 
resources are allocated at each router on the transport path. By degrading packets 
marked as RESERVED in case of insufficient resources at a router, a sender cannot 
get a better QoS by sending only RESERVED-packets. 

Reserved resources need not to be released explicitly by the sender. The estimators 
in the routers will observe an over-allocation of resources after some time after the 
end of the flow. They will adjust the estimated share of reserved resources in the 
routers. 

(b) The Feedback Protocol 

Periodically, the receiver sends back feedback information to the sender containing 
the arriving rates of REQUEST- and RESERVED-packets measured at the receiver. 
To this end a special feedback protocol needs to be implemented, e.g. RTP/RTCP 
(Schulzrinne et al 1996), in order to notify the sender about the current transmission 
quality. On receiving this feedback information the sender may begin to send packets 
tagged as RESERVED while observing a transmission rate based on the received 
feedback from the receiver. If the sender wishes more resources to be allocated for 
his flow he can keep on sending packets tagged as REQUEST. 

SRP has been tested using simulations, although some topics need further inves- 
tigation. Policing at network borders and multicasting are not covered in working 
drafts currently available and are subject of on-going research. The use of SRP for 
Virtual Private Networks (VPN) is not advised since at each router individual packet 
flows are aggregated to one large flow which is then treated uniformly. For packet 
tagging the PHB-field in the DS byte could be used. The necessary code points will 
be applied for at a future meeting of the DiffServ working group. 



4 COEXISTENCE OF DIFFERENTIATED AND INTEGRATED 
SERVICES 

Integrated and differentiated services do not necessarily have to be considered as 
competing concepts. It is rather advisable to combine both approaches. While differ- 
entiated services are recommended for rather large IP networks, the approach chosen 
for integrated services can be appropriate for limited-size networks, e.g. corporate 
networks or virtual private networks (VPN). 

Both services will be integrated if e.g. VPNs extend over a large IP backbone 
network. Such a VPN might consist of a client subnet and a server subnet intercon- 
nected by a large ISP network, possibly by use of a tunnel. By means of differentiated 
services techniques both subnets can be linked allocating bandwidth for the traffic 
between the two subnets. 

In such a case it is necessary to map integrated services parameters to differen- 
tiated services parameters, similar to the mapping of integrated services parameter 
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RSVP flows 




to ATM parameters or IEEE 802. Ip priorities. In the past such mappings have been 
defined by the Integrated Services Over Specific Link Layers (issll) working group. 
In the same way a transformation from integrated services to differentiated services 
has to be made. In this respect Ford et al. (1998) suggest to map guaranteed service 
to Premium Service and controlled load to Assured Service. 

A general framework for the integration of differentiated services and integrated 
services is proposed in (Bemet et al. 1998). The order of events for making an RSVP- 
reservation in a scenario illustrated in Figure 4 is as follows. First, a sender (server) 
generates PATH messages. In the server network these messages are processed ac- 
cording to the RSVP protocol by the border router Z2 and other RSVP-routers ly- 
ing between Z2 and the sender. In the example a reservation for differentiated ser- 
vices has been made between Z2 and Zl, e.g. a Premium Service with a bandwidth 
of 1 Mbps. In the network between Z2 and Zl RSVP-messages are transparently 
forwarded for routers not knowing RSVP. Only the router Zl processes the PATH- 
message again. The message arrives at one of the receivers (clients) who can then 
make a RSVP-reservation using a RESV-message. 

This message is processed again by Zl and Z2. Z2 has to check whether the re- 
quested reservation (e.g. 600 kbps) is covered by differentiated services reservation. 
This is for instance the case if no RSVP-reservation over the differentiated services 
network has been made yet. If there exists already an RSVP-reservation of 500 kbps 
between the two subnets, the new reservation of 600 kbps cannot be realized and will 
therefore be rejected by Z2. Finally, the RESV-message reaches the corresponding 
sender of the PATH-packet. 

When the sender begins to send the real data, Z2 has to do the appropriate mapping 
on a Differentiated Service. For instance, the DS byte in a packet has to be set to the 
correct PHB value corresponding to Premium Service if a guaranteed service was 
requested. Zl will then reset the DS byte again. 
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5 SUMMARY AND OUTLOOK 

At first the differentiated services model seems to be a highly promising concept to 
provide qualitatively improved services for the Internet since it avoids the obvious 
drawbacks of the integrated services architecture. In general, however, guaranteed 
services for application flows following this approach are not possible. It is ques- 
tionable whether customers will be satisfied with these kinds of services. It seems 
to be rather interesting to integrate the two concepts of integrated and differentiated 
services. An important aspect for the success of differentiated services will be if it 
will be possible to perform appropriate dimensioning of an IP network in a manner 
that the available bandwidth on all links will be sufficient to forward all differentiated 
services packets. This presents a very demanding challenge to network planners. 

The tasks in the IETF working groups concerning the standardization consist of 
defining the precise syntax of the DS byte. Moreover, a definition of the management 
information bases (MIBs) is needed to create a common basis for the configuration 
of differentiated services parameters in a router. Finally, all the queuing algorithms 
based on various differentiated services have to be defined in order to implement 
these services in heterogeneous router environments. 

Up-to-date information on the development in the DiffServ working group and 
related Internet drafts are available on the official homepage of the working group 
in the WWW. The mailing list dealing with many aspects of differentiated services 
and the corresponding mail archive offer a close view at on-going discussions and 
decisions made by the working group in recent time. The URLs to the mentioned 
resources on the Internet can be found at the end of this text. 

Further investigation has to be done on the support of dynamically changing ser- 
vice requirements. Usually, a customer has to negotiate a service contract with the 
ISP before making use of a service, e.g. by phone, fax, email or WWW form. The 
agreed-upon parameters then have to be used by the network operators to configure 
the routers accordingly. Approaches based on active networking could be used for 
this task, e.g. allowing the customer to run configuration scripts on the routers of 
the ISP. A different approach would be to use a signaling protocol of the requested 
service parameters, possibly an adapted version of RSVP. 

So far, the deployment of differentiated services for multicast services has been 
hardly investigated. SRP is one of the few approaches where researchers are consid- 
ering multicasting explicitly. The difficulties essentially lie in the fact that the total 
need of bandwidth for an IP multicast flow does not only depend on the transmission 
rate but also on the size of the multicast group and how the individual group members 
are spread. The latter two criteria however are very difficult to determine in advance. 
These parameters may dynamically change because of the receiver-oriented IP mul- 
ticast concept. 

For obvious reasons differentiated services could be implemented using networks 
with QoS capabilities (e.g. ATM). This of course requires a suitable mapping of 
differentiated services to ATM services. Especially in the area of ATM different con- 
cepts of IP switching are going to establish themselves. However, IP switching tries 
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to bypass IP routers and to forward the packets using switching as often as possible. 
This in turn may lead to switched packets bypassing shaping and policing fimctions 
in the routers, which is inconsistent with the differentiated services architecture. For 
these scenarios it has to be ensured that either all packets always pass routers with 
shaping and policing functions or that these functions are realized at so-called ingress 
and egress routers. 
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Abstract 

The IETF Mobile IPv6 protocol provides a mobility management scheme for the 
Internet. Mobile IPv6 handles macro-mobility and micro-mobility identically. We 
believe that a hierarchical scheme that separates micro-mobility from macro- 
mobility is preferable since it would be more scalable. In this paper, we present a 
mobility management architecture that makes use of the IPv6 Address format 
hierarchy to provide an efficient and scalable architecture to manage mobility in 
the Internet. The proposed scheme, which is fully compatible with the IETF 
solution, differentiates the intar-site mobility management from the intra-site 
mobility management. The hosts’ local mobility is handled with a local, possibly 
customized, protocol while the global mobility, i.e. across sites, is handled with 
Mobile IPv6. Our approach has two main advantages. First, the mobility of a host 
within a site is fully transparent to its correspondent nodes. As a result, the 
mobility management signaling load is minimized and some of Mobile IPv6 
security issues are solved. We show that the signaling load generated by our 
proposal is at least 69% lower than the Mobile IPv6 one. Second, by 

differentiating intra-site mobility from inter-site mobility, we provide an 
architecture that is hierarchical, scalable, flexible and customizable; each site can 
deploy the intra-site mobility management scheme that is the most appropriate to 
its particular needs. 
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1. INTRODUCTION 



Internet Mobile users require special support to maintain connectivity as they 
change their point-of-attachment. This support should provide performance 
transparency to mobile users and should be scalable. Providing performance 
transparency means that higher level protocols should be unaffected by addition of 
mobility support. Issues that may affect performance transparency are optimum 
routing of packets to and from mobile nodes and efficient network transition 
procedures (Myles, 93). The mobility support should be scalable in the sense that it 
should keep providing good performance to mobile users and should keep the 
network load low as the network grows and as the number of mobile node 
increases. This scalability issue is a very important one in the context of a still 
growing worldwide network such as the Internet. 

The IETF Mobile IPv6 standard, which provides a mobility management scheme 
for the Internet, does not completely meet these design goals. While it provides 
performance transparency, we argue that Mobile IPv6 is not scalable. In Mobile 
IPv6, a mobile node sends a location update to each of its correspondent nodes 
periodically and any time it changes its point-of-attachment. The resulting 
signaling and processing load can become very significant as the number of mobile 
nodes increases. This limitation is the result of the lack of hierarchy in the 
mobility management procedures of Mobile IPv6. In fact. Mobile IPv6 handles 
macro-mobility and micro-mobility identically. Since 69% of a user’s mobility is 
local, we believe that a hierarchical scheme that separates micro-mobility from 
macro-mobility is preferable. 

In this paper, we present a mobility management architecture that makes use of the 
IPv6 Address format hierarchy to provide an efficient and scalable solution. The 
proposed scheme, which is fully compatible with the IETF solution, differentiates 
the inter-site mobility management from the intra-site mobility management. The 
hosts’ local mobility is handled with a local, possibly customized, protocol while 
the global mobility, i.e. across sites, is handled with Mobile IPv6. Our approach 
has two main advantages over Mobile IPv6. First, the mobility of a host within a 
site is fully transparent to its correspondent nodes. As a result, the mobility 
management signaling load is minimized and some of Mobile IPv6 security issues 
are solved. We show that the signaling load generated by our proposal is at least 
69% lower than the Mobile IPv6 one. Second, by differentiating intra-site mobility 
from inter-site mobility, we provide an architecture that is hierarchical, scalable, 
flexible and customizable. Our proposal is efficient; it provides optimum routing 
from and to mobile hosts and improves handoff latency. It is flexible and 
customizable; each site can deploy the intra-site mobility management scheme that 
is the most appropriate to its needs. 
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This paper is structured as follows. In the next section, we introduce some 
terminology that is used throughout this paper. Section 3 presents the related work 
including the IETF Mobile IPv6 and its hierarchical derived proposals. Section 4 
details our mobility management proposal. Section 5 evaluates and compares the 
performance of our scheme with the IETF one. Section 7 concludes our paper. 



2. TERMINQLXXJY 

The following terms are used in this paper to identify the principal network entities 
that are of interest to our proposal. A mobile host, MH is a node that may move 
through the Internet. A correspondent host, CH, is a host communicating with the 
MH. The network and the site of a MH when it is not travelling are respectively 
called the home network and the home site of the MH. The network and the site 
that a MH may visit are respectively referred as the foreign network and the foreign 
site of the MH. A MH’s Care-of Address is the global IP address the MH acquires 
when visiting a foreign network. This address is topologically correct on the 
foreign network. A MH’s Site Care-of Address is the first Care-of Address that a 
MH acquires when visiting a site. When MH is in its home network, it is accessible 
through its Home Address. 



3. RELATED WORK 



In this Section, we present some of the mobility schemes proposed for the Internet. 
We start with IETF Mobile IPv6 and then describe two of its hierarchical derived 
approaches, which have been proposed in the context of Mobile IPv4. 

The Mobile IPv6 protocol is currently being specified by the IETF IP Routing for 
Wireless/Mobile working group (Perkins, 96b). With Mobile IPv6, each time the 
mobile node moves from one subnet to another, it gets a new care-of address. It 
then registers its Binding (association between a mobile node’s home address and 
its care-of address) with a router in its home subnet, requesting this router to act as 
the home agent for the mobile node. This router registers this binding in its Binding 
Cache. At this point, the router serves as a proxy for the mobile node until the 
mobile node’s binding entry expires. The router intercepts any packets addressed to 
the mobile node’s home address and tunnels them to the mobile’s care-of address 
using IPv6 encapsulation. The mobile node sends also a Binding Update to its 
correspondent nodes, which can then send packets directly to the mobile node. 
While this protocol optimizes the routing of packets to mobile hosts, it is not 
scalable. As the number of mobile nodes increases in the Internet, the number of 
Binding messages will also increase proportionally and add a significant extra 
load to the network. 
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Caceres and al. have proposed a hierarchical mobility scheme based on Mobile 
IPv4 that separates three cases : local mobility, mobility within an administrative 
domain and global mobility in order to reduce the generated signaling load 
(Caceres,96). This proposal has been made in the context of Mobile IPv4 which 
uses foreign agents; agents that mobile hosts connect to when they visit a foreign 
network. (Caceres,96) defines a hierarchy of foreign agents. In this proposal, each 
subnet that a mobile node could visit has one or more subnet foreign agents, which 
manage local mobility. On top of those subnet foreign agents, a domain foreign 
agent manages mobility across the different subnets of an administrative domain. 
The mobile node’s home agent only keeps track of the movement of the mobile 
node across administrative domain boundaries. As a result, the mobile node’s 
motion within an administrative domain is transparent to the home agent and its 
correspondent nodes. The hierarchical architecture of this scheme is very 
interesting but strongly relies on the deployment of foreign agents, which makes it 
incompatible with the MobileIPv6 protocol. 

In the scheme, proposed by Balakrisnan et al. (Balakrisnam,95), packets destined 
for a mobile node are delivered to the mobile node’s home agent using the IETF 
Mobile IPv4 and are then multicast to multiple base stations in close vicinity of the 
mobile node. While this approach is hierarchical, we believe that this solution is 
not very efficient and scalable. In fact, packets destined for a mobile have to transit 
through the home agent which can be distant from the mobile node’s current 
location. This has the effect of increasing packet delivery latency , handoff latency 
and the Internet load. 

4. AHIERARCHICAL MOBDJTYMA^Q3ME 



Mobile IPv6 handles local mobility of a host (i.e. within a site or a network) the 
same way as it handles global mobility (inter-site or inter-network mobility). In 
fact, in Mobile IP, a mobile user sends location updates to its home agent and its 
correspondent nodes each time it changes its point-of-attachment regardless of the 
locality and amplitude of its movement. As a consequence, the same level of 
signaling load is introduced in the Internet independently of the user’s mobility 
pattern. 

We argue that this approach is not scalable and that a hierarchical solution is more 
appropriate to the Internet. We believe that a user’s mobility within a site or a 
network should be managed locally and transparently to its correspondent nodes. 
Using such a hierarchical approach has at least two advantages. First, it improves 
handoff performance, since local handoffs are performed locally. This increases the 
handoff speed and minimizes the loss of packets that can occur during the 
transition phase. Second, it significantly reduces the mobility management 
signaling load on the Internet since the signaling load corresponding to local moves 
do not cross the whole Internet but is confined to the site or the network. This 
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hierarchy is furthermore motivated by the significant geographic locality in user 
mobility patterns. As shown in (Kirk,95), most of a user’s mobility is local. 
According to this study, 69% of a user’s mobility is within its home site (within its 
building and campus) and 70% of all professionals can be classified as mobile. It is 
therefore important to design a hierarchical mobility management architecture that 
optimizes local mobility. 

We propose a hierarchical architecture that separates local mobility (within a site) 
from global mobility. Our architecture is hierarchical in two points. First, it 
separates the local mobility management from the global one. Local handoffs are 
managed locally and transparently to mobile’s correspondent hosts. Second, it 
clearly separates the protocols managing local mobility from the protocols 
managing global mobility. In fact, while the hierarchy in the mobility management 
operations could be performed by the same protocol, we propose to use two 
different protocols. As illustrated by Figure 1, we define the concepts of MISP 
(Mobility Internal Site Protocol), that manages mobility within a site, and of MESP 
(Mobility External Site Protocol), that manages mobility between sites. The 
concept of site is quite general. We use the definition of site as it is given in 
(Hinden,98). A site is a set of networks belonging to the same administrative 
entity, such as a company or an access provider. Any two hosts of a site must be 
able to exchange packets without the support of the Internet backbone. A site is 
connected to the rest of the Internet via one or several interconnection routers. The 
approach that we propose provides more flexibility to the sites that can deploy the 
MISP the most appropriate to their needs. A large site can, for example, use a 
hierarchical mobility management protocol, and add an extra level of hierarchy to 
the global architecture. 



Internet 




Figure 1 MISPs and MESP 
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41 Inter-site Mobility Management Issues 

The Inter-site Mobility Management protocol manages the mobility of hosts 
between sites. This protocol has to be global to the whole Internet. We propose to 
use the Mobile IPv6 protocol since it is the current IETF solution and we believe 
that it is an efficient solution to manage macro-mobility. As with the “regular” 
Mobile IPv6, a mobile host requires the service of a home agent in its home 
network. This HA intercepts packets addressed to the MH and forwards them 
toward the MH’s current Care-of Address. 

When a mobile host gets into a new site, it obtains a Care-of Address , and 
communicates it to its home agent and possibly to its correspondent nodes via the 
emission of a Binding Update composed of its Home Address and its Care-of 
Address. Thereafter, and as long as the mobile host stays within this site, this Care- 
of Address, that we call the Site Care-of Address , is used in all Binding Updates 
sent to the Home Agent and Correspondent Hosts. Note that this Site Care-of 
Address is used in the Binding Updates even if the mobile host moves within the 
site and gets new Care-of Addresses. 

Upon reception of a binding, the HA and CH update their binding list and use the 
Site Care-of Address specified in the BU to communicate with the MH, in 
conformance with the Mobile IP protocol specification (Perkins, 96b). 

42 Inlra-site Mobility Management Issues 

The Intra-site Mobility Management protocol manages the mobility of hosts within 
a site. As opposed to the MESP, the MISP can differ from sites to sites. They can 
be customized to each site’s needs. For example, a large site may deploy a 
hierarchical MISP while a smaller one may use Mobile IPv6. In the next section, 
we describe a Mobile IPv6-based MISP. This solution results in a 2 level-Mobile 
IPv6 protocol: one level manages macro-mobility (MESP) and the other one 
manages micro-mobility (MISP). 

42.1 Mobile IP-based MISP 

When a mobile host moves within a site and changes its point-of attachment, it gets 
a new Care-of address, CoA 2 , and sends a Binding Update, composed of CoA 2 and 
its Site Care-of Address, CoA s , to all of the site interconnection routers 1 of the site 
and a Binding Update composed of its Home Address and its current Care-of 
Address, CoA 2l to its local correspondent nodes. Each interconnection router of the 



1 All the interconnection routers of a site could be made accessible via the use of a well-know IPv6 
multicast address or via a multicast address communicated to the MHs by the Neighbor Discovery 
protocol. 
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site maintains a Binding list with one entry ( CoA, CoA s ) for each mobile host 
currently roaming in the site. 

When a packet addressed to a mobile host arrives at one of the site’s 
interconnection router, this router searches into its Binding list for an entry whose 
Site Care-of Address field matches the destination address of the incoming packet 2 . 
(1) If no entry is found, the packet is routed normally with in the site. (2) If an 
entry is found, the router tunnels the packets to the current (local) Care-of address 
of the mobile host as specified in Mobile IPv6. 

When a host sends packets to a mobile host that is located within its site, it first 
uses the mobile home address and then switches to the mobile host’s Care-of 
Address as soon as it receives a Binding Update. As a result, if the site is the home 
site of the mobile host, the first packet is intercepted by the mobile host’s home 
agent and tunneled to its current site address. If the site is not the home site of the 
mobile host, the first packet is intercepted by one of the interconnection routes of 
the site and then tunneled to the mobile host’s site address. 

4 22 OthersMISPS 

Others MISPs could be deployed. For example, the Sony VIP (Teraoka,92) and the 
Columbia MHP (Bhawat,95), that were designed for small networks, could be 
good candidates for MISPs. The PIM-based mobility management scheme 
presented in (Castelluccia,98) could be used for larger sites. A GSE-based 
approach (O’ dell, 98) could also be considered. In this approach, the site 
interconnection routers would dynamically replace the destination address of the 
packet addressed to a mobile host home address or Care-Of Address with the 
current Site Address. While this solution prevents from encapsulating packets and 
makes better use of the local resource, it introduces some security and 
identification problems. 

4 23 MISPs Compatibility Issues 

An important consideration in our architecture is the MISP compatibility. In fact, it 
is important, for extensibility reasons, that a mobile host is able to use the different 
MISPs without having to understand all of them. Therefore, the operations 
performed by the mobile hosts in the different MISPs have to be standardized. 

We propose that the mobile host operations be limited to the emission of Binding 
Updates to one or several special addresses. These addresses, that can change 
from MISP to MISP, could be communicated to the mobile host through the IPv6 



2 Maintaining these per-mobile host entries is not necessarily a scalability limitation since data 
structures exist that allow routers to handle long lists of entries efficiently (Sklower,93). 
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Neighbor Discovery mechanism. For example in the Mobile IP-based MISP this 
address is the multicast address of the site interconnection routers . 

43 The IPv6 Mobility bit 

The site interconnection routers play a central role in our architecture. In fact, they 
filter all incoming packets to demultiplex packets addressed to mobile hosts from 
those addressed to fixed hosts. This operation can be expensive if the routers must 
compare the destination address of all incoming packets against the list of mobile 
hosts roaming within the site. To minimize this cost, we propose the definition of a 
mobility bit within the IPv6 Addresses to help routers to efficiently demultiplex 
packets addressed to mobile hosts from those addressed to fixed host. 

The IPng Working Group has defined an global address format for IPv6, the 
Aggregatable Global Address Format (Hinden,98). This address, that is presented 
in Figure 2, is structured into a three level hierarchy : (1) Public topology (48 bits), 
(2) Site topology (16 bits) and (3) Interface Identifier (64 bits). The public 
topology is the collection of providers and exchange points. The Site topology is 
local to a specific site or organization. It is used by an individual organization to 
create its own local addressing hierarchy and to identify subnets. Interface 
identifiers identify interfaces on links. 




Figure2 IPv6 Aggregatable Global Figure 3 Modified IPv6 Unicast Address 
Unicast Address Format 



For performance concerns, we propose to define a Mobility bit within the Site 
topology field of the IPv6 address format (see Figure 3). This bit, which is only 
meaningful within a site, is used by the site interconnection routers to demultiplex 
packets addressed to a mobile host from the packets addressed to a fixed host 
efficiently. The mobility bit of a host is set to 1 in mobile hosts’ addresses and set 
to 0 in fixed hosts’ addresses. By examining this bit, the site interconnection 
routers can instantly know if the incoming packet should be routed internally by 
the standard routing protocol or the local MISP. As a result, the packets addressed 
to fixed hosts do not suffer from the routers’ MISP processing. The mobility bit is 
not a requirement. It is just a suggestion to speed packets’ processing at routers. 
Note that the mobility bit does not require to be deployed in every sites and does 
not affect the routing of packets on the backbone since it is only meaningful and 
used within a site. 
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5. COMPARISON AND EVALUATION 



In this section we compare the performance of our proposal and of Mobile IPv6. 
When comparing the performance of different mobility management schemes, 
several factors have to be taken into consideration. Among these factors, three are 
particularly important (Myles, 93): (1) The scalability property of the schemes, i.e. 
how do the schemes behave as the network grows and the number of mobile hosts 
increases. (2) The routing performance of the schemes, i.e. what is the extra 
latency introduced by each of the schemes. (3) The transition performance of the 
schemes, i.e. how fast are the transition phases performed. 

5.1 Routing and Transition Performance 

The routing and transition performances of both schemes are quite similar. The 
routing is optimum, packets follow the shortest path from the correspondent nodes 
to the mobile host, and handoffs are performed locally in both proposals. In fact, 
in our proposal, local handoffs are managed within the site. In Mobile IPv6, while 
location updates have to cross the whole Internet to reach the mobile host 
correspondent nodes, a mechanism is provided to smooth out transitions. After 
switching to a new default router, a mobile node may send a Binding Update to its 
previous default router, asking him to redirect all incoming packets to its new 
Care-of Address. 

52 Scalability Performance 

The main performance difference between the compared approaches resides in 
their scalability property. The scalability property of a protocol can be evaluated in 
terms of its overhead growth on the Internet with the size of the Internet, the 
number of mobile hosts and the number of correspondent nodes. This overhead 
can be evaluated by comparing, for each proposal, their memory requirements and 
their signaling load, i.e. the bandwidth used by the control messages, such as the 
Binding Updates. 

52.1 Memory Requirement 

We evaluate, in this Section, the memory requirement of each proposal. 

Mobile IPv6 requires that (1) each mobile node maintains a list of its 
correspondent nodes and (2) each correspondent node maintains a binding per 
mobile host it is communicating with. The corresponding memory requirement, 
Mem M jp , can therefore be evaluated as follow: 



entry * 



Memujp = 2 X #MH X #CH X Size , 



( 1 ) 
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Where #Af// is the average number of mobile hosts on the Internet, #CH is the 
average number of correspondent hosts of each mobile host and Size entry , the size 
of the binding that a CH must maintain per MH, and a MH must maintain per CH. 

Our hierarchical proposal requires that (1) each interconnection router of a site 
maintains a binding per mobile currently visiting the site, (2) each mobile node 
maintains a list of its correspondent nodes and (3) each correspondent node 
maintains a binding per mobile host, it is communicating with, that is not roaming 
within its home site. The corresponding memory requirement of our approach, 
MeniHMip, can therefore be evaluated as follows: 

MerriHMip = 1(1 + y)x#CH + #Routers] x #MHxSize entry (2) 

Where #Routers is the average number of interconnection routers of a site in the 
Internet and y, the percentage of the non-local mobility. According to [Kirb95], 
y= 0.31 therefore, 

Mem HMlP - [(1.31 )X#CH + #Routers] X #MHxSize entry (3) 

The gain (or loss) of our approach over the Mobile IP approach is defined as: 

GMem,Av = (MeiitMip - Mem A BA ) / Memuip Or, (4) 

GMem,Av = 0.345 - #Routers/ (2 X #CH), (5) 

This gain is displayed in Figure 4 as a fonction of (#CH/#Routers). 



Gmem,av 




#CH/#Routers 



These results show that our approach’s memory requirement is lower than the 
Mobile IP one if #CH is larger than a threshold, T, equal to 1.45x #Routers. 

522 Signaling Load Overhead on the Internet (Backbone) 

Both mobility management proposals use Binding Updates to set up states in the 
routers and/or the end-hosts. This signaling has a cost in terms of bandwidth 
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utilization on the Internet. In this section, we compare the signaling load 
introduced by both proposals on the Internet (backbone). We evaluate, for each of 
these schemes, the aggregated signaling load bandwidth consumed on the Internet. 
This aggregated bandwidth is independent of the number of nodes that the Binding 
Updates have to cross until their destinations, but rather corresponds to the 
signaling bandwidth on one link. To simplify, we do not consider the 
Acknowledgement messages that can sometimes be sent in response of Binding 
Updates. We do not compare the local signaling load because they are comparable 
for both schemes and because we argue that local resource is not the most critical. 
In this evaluation, we differentiate three types of mobility: the local mobility of a 
host within its home site , the local mobility of a host within a foreign site , and the 
inter-site mobility of a host. We then evaluate the average signaling load over these 
three mobility patterns. 

Binding Update Emission Frequency 

The signaling load of a scheme depends directly on the Binding Update Emission 
Frequency. According to (Perkins, 96b), a Binding Update is sent periodically to refresh the 
corresponding cache entries, and anytime a mobile host change its point-of attachment. The 
emission frequency of a Binding Update, freq, is therefore dependant on the mobility pattern 
of a host ,freq MOV and the refresh frequency, freq REF . It is defined as follows: 



If freq REF > freq M ov 

Freq = a xfreq MOV if freq REF > freq MOV ( 6) 

Freq - a 9 xfreq REF iffreq MOV > freq^p (7) 

With 

a = |” freqREF / freqMOV~\ a x - [~ freqMov / freqREF ~ | 



Local Mobility within the Home Site 

When a mobile host, using Mobile IPv6, is moving within its Home site, it sends a 
Binding Update to each of its correspondent nodes and to its home agent at a frequency 
of freq . If our hierarchical proposal is used, no Binding has to be sent at all. 

As a result, the signaling bandwidths respectively generated by Mobile IP, 
BWsiG_Mip,home(t) and by our proposal, BW S iG_HMiPMme(t), when a MH is roaming within 
its home site, are defined as follows: 

BW SIGMIP jio, f Jt) = Size B u xfreq x #CH, (8) 

BWsiQHMlPfameft) = 0. (9) 
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where Size BU 3 is the size of a Binding Update and #CH is the number of correspondent 
hosts that are not in the home site. 

Local Mobility within a Foreign Site 

When a mobile host, using Mobile IPv6, is moving within a foreign site, it sends a Binding 
Update to each of its correspondent nodes and to its home agent at a frequency equal to ffeq. 
If our proposal is used, the mobile host only sends a Binding Update to each of its 
correspondent nodes and to its home agent at a frequency equal to the refresh frequency, 
freqREF 4 . As a result, BW S i G _Mip,forcign and BW S ig_hmip, foreign are defined as follows: 

BW S ig_mip foreign = Size BU xfreq x (#CH+1) (10) 

B Wsig_hmip foreign = Size B u xfreq REF x (#CH+1) (11) 



Inter-Site Mobility 

The signaling bandwidth introduced on the Internet when a mobile node is transiting 
from one site to another is the same in both schemes. For each of these schemes, the 
mobile sends a Binding Update to its home agent, distant correspondent hosts and to 
the correspondent hosts that were in its previous site. Therefore, BW S iG.transit is defined 
as follows: 

BWsiammxi = Size BU X (#CH+ #ch + 1) (12) 

Where itch is the number of correspondent hosts that are located in the previous site. 

Analysis of the Results 

In this section, we evaluate, for each of the mobility pattern, the gain achieved by 
our proposal over Mobile IPv6, G. We note Ghome the gain when the host is moving 
within its home site, Gf 0re ign the gain when the host is moving within a foreign site, 
and Gtransit the gain when the host is transiting from one site to another. G v (with 
Y= home or foreign), and G tra nsit are defined as follows: 

G y — (BW S /g_mip,y ~ BW sig _ HMIP y)/ BW sig _ mipy (13) 

Gtransit = (BWsiG_MIP,transirBWsTGjmP,transit)yB^SIG_MIP,transit (14) 

We also evaluate the average gains, G A v over the three mobility patterns, by using 
the result established in (Kirk, 95) that 69% of a host’s mobility is local. 
G A v is defined as follows: 



3 The size of a Binding Update is equal to the size of an IPv6 header (40 bytes) + the size of a Binding 
Update Extension Header (28 bytes), so 68 bytes. A Binding Update can however be smaller if it is 
sends with some payload. 

4 The Binding Updates sent to the local correspondent hosts do not cross the Internet 
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G av = 0.69 X G luy)lu . + 0.31 x(ax Gf orrig „ + fix G, mnsil ) (15) 

where a + j 8=1, a = (N-l )/N and p = 1/N, N being the average number of 
different points-of attachment of a mobile host within a site. 

a and P characterizes the mobility pattern of a user outside of its home site, 
a defines the intra-site versus inter-site moves ratio of the mobile hosts. A large 
P means that the user is frequently changing sites. A large a means that the user is 
amainly roaming within a site and barely changes sites. For example, a a of 0.9 
means that the mobile host changes, in average, 10 times its point-of attachment 
within a site before moving to another site. 

a and P can be written as functions of T and freq U ov : 

P = 1/(T xfreq M ov) ( 16) 

The achieved gains computed with the previous results are presented in the 
following table. 

1 



0.7 

0 1 10 100 
freqMoUfreq RRF 

Table 1 Signaling Load Gains Fig. 5 Average Signaling Load Gain 

These results show that our proposal’s average signaling load on the Internet is at 
least 69 % lower than the signaling load generated by Mobile IP. 

According to equations (6) and (7),/ = {Jreq-freq RE v)/freq is defined as follows: 

/ = (a- 1)/ a if freq MO v/freq RE F ( 1 7) 

f=(a’-freq R EF/freq M ovya’ if freq REF /freq M ov (18) 

Figure 5 shows the average gain Gav as a function oifrequov/freqEEF for a equal to 
V 2 and 1. This figure shows that the gain of our approach over Mobile IP is always 
larger than 69% and gets larger when a mobile host has a high mobility frequency, 
freq MO v, and is mainly roaming within a site (a is close to 1.0). The peaks that 
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appear when freqMov is smaller than freqppp exhibit a synchronization problem of 
Mobile IPv6. In fact, when frequov is smaller and is not a multiple of freqmr. 2 
consecutive Binding Updates are sent: one to refresh the cache entries of the 
correspondent hosts followed by the Binding Update sent when the mobile host 
changes its point-of attachment. This problem does not exist with our approach 
since the second binding is not sent on the Internet but is confined inside the site. 



6. DlSaJSSmSANDOMlXJSIONS 



This paper presents a mobility management architecture for the Internet that is 
hierarchical and flexible. The proposed scheme, which is fully compatible with the 
IETF solution, differentiates the inter-site mobility management from the intra-site 
mobility management. The hosts’ local mobility is handled by a local, possibly 
customized, protocol while the global mobility, i.e. across sites, is handled by 
Mobile IPv6. 

When a mobile host is roaming within its home site its mobility is fully hidden 
from its external correspondent nodes that see the mobile host as a regular fixed 
host. This property could be achieved with ’’regular” Mobile IPv6 by using the 
triangular routing mode (i.e. the mobile host does not send any binding update to 
its correspondent nodes). However in this case, and as opposed to our proposal, all 
packets addressed to a mobile host are first delivered to its Home Agent and then 
forwarded to the current mobile host’s care-of address. These indirections can 
drastically increase the latency and increase the network load if the home site of 
the mobile host is large and the mobile host is far away from its home agent. Our 
proposal proposes to use several home agents per mobile host. One which is 
located on the mobile host’s home subnet as in the regular Mobile IPv6 protocol, 
to handle local communications (between the mobile host and correspondent hosts 
of the site) and the others located on the site interconnection' routers to handle 
external communications (between the mobile hast and correspondent hosts 
outside the mobile host’s home site). 

When a mobile host is roaming within a foreign site, its local mobility (i.e. within 
this site), is hidden from its correspondent hosts. These hosts are aware that the 
mobile host is visiting the foreign site but are unaware of its local moves. This 
level hierarchy is not provided with the current IETF Mobile IPv6 proposal, which 
requires that a correspondent host be aware of all mobile host’s moves. Note that 
Mobile IPv4, which is the protocol that manages mobility in IPv4, defines the 
concept of Foreign Agents. A foreign agent is a agent a mobile host may register 
with when it is visiting a foreign network. Packets addressed to the mobile host 
are then delivered to its foreign agent and forwarded to the mobile host. Foreign 
agents have originally been defined to limit the constraint of mobility on the short 
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IPv4 address space 5 , but (Caceres,96) shows that they can also be very useful in 
defining a hierarchical mobility management scheme. However foreign agents 
have not be maintained in Mobile IPv6, since IPv6 provides a much larger address 
space than IPv4. We argue that foreign agents should be reconsidered and adapted 
to Mobile IPv6 to define an hierarchical scheme. This paper proposes to deploy 
Mobile IPv6 foreign agents in the site interconnection’s routers to hide mobile 
host’s local moves from their correspondent nodes. These agents provide functions 
that are very similar than the home agents ones. 

As shown in this paper, using a hierarchical mobility management scheme reduces 
the mobility management signaling load. In fact, we show that the signaling load 
generated by our proposal is more than 69% lower than the Mobile IPv6 one. 
When a mobile host is roaming within its home site, no binding update has to be 
sent over the Internet. Beside from the Internet resource saving, eliminating the 
signaling has several other advantages. First, it reduces the risk of attacks and 
tracking of a mobile host. It also eliminates the need of authentication and 
encryption of the Binding Updates and the associated difficult issue of the keys 
distribution over the Internet. Two, eliminating the signaling allows to provide 
mobility management support to sites which are connected to the Internet with a 
unidirectional and/or an asymmetrical link, such as a satellite link. Three, 
eliminating the signaling load is important for scalability reasons. Mobile IPv6 
requires that each host maintains one entry per mobile host it is communicating 
with. This requirement can be overwhelming for big servers, such as Web servers, 
that must maintain one entries for each of its mobile clients. By handing locally the 
moves of mobile hosts within their home site, we reduce the number of mobile 
hosts on the Internet and consequently the number of entries that each 
correspondent host should maintain. 
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Abstract 

Active networks are a new area of research and as such there are many different 
ideas as to what an active network should be. One possible architecture is 
described in this paper along with experimental results that characterize the 
performance and verify the operation of a fundamental component of the 
architecture. This component is an active library resolution service that allows 
active programs to find and load active libraries from the network. This strategy is 
evaluated and reviewed in detail. The results indicate that the research prototype, 
which only investigates the issue of where and how to obtain arbitrary libraries 
from the network, does work and should be scalable. The proposed architecture 
relies on a novel conceptualization of what min im u m functionality should be in an 
extensible operating system and how systems could be built in an active network. 
The view is that extensible operating systems and mobile code allow functional 
components of the operating system and user applications to be obtained from the 
network. Thus, the overarching goal is a dynamic system in which system 
modules, such as file systems or network protocols, are loaded into and unloaded 
from the kernel on demand and the modules are obtained transparently from the 
network for active programs that need them. 
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1 INTRODUCTION 

The evolution of the modem Internet is a process that has taken decades and has 
resulted in a robust and flexible service to an ever-growing number of users. This 
rapid increase in usage and addition of service requirements is straining the 
Internet and forcing the development of a new protocol for the network, Internet 
Protocol version 6 (IPv6) (Lee, et al , 1998). As IPv6 development has also taken 
a number of years and will continue to take a number of years, various “glue” 
technologies have been and are being used in the interim until there is a pure IPv6 
network. The ubiquitous World-Wide Web had been around a number of years 
before it became popular. These two cases illustrate an important problem in 
protocol standardization and deployment - it takes a significant amount of time to 
develop a protocol and to have significant wide-spread adoption of a protocol to 
make it truly useful. This problem is especially troublesome for protocols that 
operate at the network or transport layer; users must have a supported platform 
and/or must be able to get their service providers to support the service. For 
example, IP multicast service deployment has been resisted by many service 
providers. One proposed concept for reducing protocol development, deployment, 
and acceptance time is the active network (DARPA, 1998a and Tennenhouse, et 
al ., 1997). An active network is a network that is dynamically configurable and 
allows for “rapid injection” of new protocols. This is done by not standardizing on 
how bits are transferred across the network, but rather on uniform computational 
models for protocol processing. 

This paper presents an architecture that supports the dynamic nature of an active 
network by allowing active programs to find and load arbitrary active libraries. It 
discusses the operation and implementation of a research prototype that was used 
to verify the core component of the architecture, the active library resolution 
service. The paper also presents the evaluation of this prototype which shows that 
it works and works well. 

2 ACTIVE NETWORK RESEARCH 

Tennenhouse, et al (1997), provide an excellent summary of current active 
network research. Important active network concepts used in this work are briefly 
reviewed. In the modem network, a packet is the delivery unit that transfers data 
from one point to another. The concept behind active networks is to make the 
packet “smarter” by inserting code as data. This code is executed at intermediate 
switching nodes and allows custom computations to be performed on arbitrary 
packets. Clearly, there are a number of problems that arise and these are discussed 
below in relation to the enabling technologies. 

A subtle change in how code-carrying packets are viewed can result in a powerful 
abstraction since, in this case, the packet is not a vehicle for carrying code but it is 
the code itself. This new vehicle is commonly called a capsule (Tennenhouse and 
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Wetherall, 1996). Capsules are self-contained programs that instruct 
programmable switches on how to process themselves and may leave behind 
persistent code to help process other capsules. 

3 ENABLING AND RELATED TECHNOLOGIES 

Mobile agents (Nwana, 1996 and IBM, 1997), a type of intelligent agents, are very 
similar in concept to an active network. Mobile agents are software that can be 
executed on arbitrary nodes in the network and can forward themselves through the 
network. The major difference between mobile agents and active networks is that 
mobile agents are at the application layer and active networks are generally at the 
network layer. 

Key enabling technologies for an active network include extensible operating 
systems (Engler, et al., 1995, Shapiro, et al., 1997, and Bershad, et al., 1995) and 
mobile code (Thom, 1997, Yemini and da Silva, 1996, and Wahbe, et al., 1993). 
Extensible operating systems take the micro-kernel approach one step further - 
they separate resource management from resource protection (Engler, et al., 1995). 
Thus, the extensible operating system can dynamically load different modules at 
run-time, such as memory managers or file systems. A run-time user-customizable 
kernel has a number of problems such as efficient inter-module communications 
and secure execution. One clearly does not want a poorly behaved module to 
terminate the entire operating system. 

Mobile code allows platform-independent, or portable, software to be developed. 
Mobile code technology, such as Java (Gosling, et al., 1996) and NetScript 
(Yemini and da Silva, 1996), comes in two variants, interpreted and dynamically 
compiled. Interpreted code is executed by software that performs the function 
described by the source code or scripting language. An alternative approach to 
interpretation is to run a machine-neutral bytecode in a virtual machine 
environment. Dynamically compiled code takes the portable source code and 
compiles it so that a native machine-code binary can be used. Clearly, dynamic 
compilation is faster than interpretation; however, there are numerous code-safety 
issues that must be resolved (Wahbe, et al., 1993 and Keppel, etal., 1991). Both 
mobile code execution mechanisms should be available in a truly flexible active 
network environment. The Liquid Software project shows that it is possible to 
compile simple Java code to native code on a 200-MHz machine in the time it 
takes to receive the code on a 10 Mbps network connection (Hartman, et al., 1996). 

Extensible operating systems can be combined with mobile code to allow truly 
dynamic operating environments. This combination allows a vendor to write one 
application that runs on any hardware platform that incorporates the mobile code 
and extensible operating systems in a standard way. Obviously, much research and 
standardization remains to be performed in order to efficiently, reliably, and 
securely perform this task. This research focuses on one extension to, or 
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modification of, this model. If the application needs a module, such as a database 
manager, that is not currently on the system then the application should be able to 
resolve it from the network. If the application was properly debugged and written, 
the module that is required must exist somewhere and there is a high probability 
that it can be found on the network. 

4 ACTIVE NETWORK OPERATING SYSTEMS 

The system model proposed by this work is a synthesis of the active network 
capsule concept, mobile agent operation, the extensible nature of research 
operating systems, and use of mobile code. The premise is that there is an 
operating environment, called the active network operating system (ANOS), that is 
run-time extensible and supports mobile code. It is divided into three privilege 
levels, the kernel space, the active handler space, and the user application space, as 
shown in Figure 1. Operating system “personality modules” are referred to as 
active handlers. The kernel space only provides basic resource protection and 
access functions, such as access to a file and prevention of multiple simultaneous 
writes. The active handler space provides the application services that users 
require, including file systems, memory managers, network protocol stacks, etc. 
Thus, different users can select different file systems according to what best suits 
their need. The minimally required modules are an active code resolution module 
and a network interface. If the resolution module is robust enough, all other 
modules can be obtained from the network. The user space is akin to the 
traditional user space. 




Figure 1 Block diagram of the ideal operating system at a node. 



A brief discussion of the active network operating system architecture is given 
below. Details of the proposed architecture can be found in (Lee, 1998). There are 
two types of active code, one that exists within the operating system itself and the 
other that exists in user space. The user space code can be viewed as a mix of 
traditional applications and mobile agents. The handler code can be viewed as a 
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mix of traditional extensible operating system modules and active protocols sent 
across the network. The discussion focuses on handler code as opposed to user 
code. 

4.1 System Bootstrap 

Assume that a system is shipped with a minimal operating system of the core 
handler, a base protocol, and a resolution service. The base protocol provides 
network services for the resolution service and uses the bare memory protection 
and access features of the kernel. If a form of Dynamic Host Configuration 
Protocol (Droms, 1997) is embedded into the resolution service, then 
administrators at a site can ensure that all nodes are loaded in the same fashion. 

The only special modification that is required in the proposed resolution service is 
that a fixed sequence of resolution requests must be used. No other special 
protocol is required. Upon bootstrap, the node executes the initialization code 
which performs a set of resolution requests for a compiler, an interpreter, a 
memory manager, a file system, and so forth. Note that the interpreter and/or 
compiler must be able to support the native binary format of the machine in 
question and it may be preferable to include a base dynamic compiler in all active 
network operating systems. If so, then this compiler should allow itself to be 
replaced. Once the compiler is installed, all other modules can be made to run on 
the node in question. 

4.1 Active Handler Resolution 

The precise installation process for active handlers is still an open area of research. 
There are numerous communications and code safety issues that need to be 
addressed; however, run-time insertion of dynamic code has been proven to be 
possible within an Active Bridge (Alexander, et al ., 1997a). A model of how the 
resolution strategy operates from the network standpoint is provided later. The 
remainder of this section describes how code could be resolved and inserted into 
the operating system. The assumption is that there is a mechanism to query the 
network for active code, for the network to return information about the location 
and properties of active code, and for the network to return the code itself. 

Figure 2 shows a block diagram of the resolve handler. Clearly some mechanism 
must be present to ensure that the code that is received is 1) authorized to be 
executed by the node, and 2) complete. Assuming that code is somehow received 
by the system, two levels of authorization checking are performed by the proposed 
system. The gatekeeper makes a simple high-level check to verify that the code 
format is supported and the capsule was received intact. The security service 
provides more detailed authentication mechanisms before the code is either directly 
loaded into the handler space, compiled into code that can be loaded into the 
handler space, or interpreted within the handler space. Code that is loaded into the 
handler space runs using special guard code (Wahbe, et al , 1993) to ensure that 
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handler failure does not cause catastrophic failure for the operating system and to 
ensure that the handler does not try to override operating system protections. 



Network 




Figure 2 Block diagram of the resolver handler. 



Code delivery, from the user space, in the user space, and from the network, may 
not be complete. For example, a user application may be run that requires a newer 
windowing system library than is present on the system. Without that library, the 
application will not run; however, if the code is active in nature then it can make a 
call to the resolution service which will search the network for the appropriate 
library. Once that library is found, it can be installed and the application can be 
run - all without human intervention. Two mechanisms can be used to ensure that 
the code is complete. The first mechanism is to embed, in special headers 
transferred along with the code, all the required libraries. The gatekeeper can use 
this information to verify that required libraries are present and if not present, can 
either obtain them for the active code or reject the active code. The second 
mechanism is the on-demand resolution and loading of active libraries. This can 
be performed by the guard code that is inserted into the active code or by the 
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interpreter. The former method allows for efficient, non-stop execution of the 
active code and the second method ensures that the system is robust. 

5 ACTIVE CODE TRANSPORT AND RESOLUTION 

The remainder of the paper discusses how code can be transported to and from a 
node and a proposed strategy for resolving libraries from the network. There are 
two methods that can be used to deliver active code, retrofitting existing protocols 
or creating a new internet protocol. Clearly, the later is not a preferred solution if 
the only driving reason to create a new internet protocol is to support active code. 
Thus, retrofitting IP is the popular solution (Wetherall and Tennenhouse, 1996 and 
Alexander, et al . , 1997b). The solution entails creating a new IP header option, the 
Active IP option (Wetherall and Tennenhouse, 1996) that can be used to inform a 
node that the code in the packet is a capsule. Nodes that are aware of the active 
network can process the code and nodes that are not aware of the active code will 
simply ignore the code. These retrofitting proposals provide a way to identify the 
program encoding that is used by the active code and authentication information 
about the active code. This is insufficient as other information will be required for 
a robust active network; however, the design of the Active IP option does not 
preclude this as it uses an encoding similar to Multipurpose Internet Mail 
Extensions (MIME) headers (Borenstein and Freed, 1996). Thus, if information 
such as distribution restrictions, copyright and usage cost information, revision 
histoiy, required libraries, and so forth need to be attached to active code, it can be 
easily done. 

The retrofitting proposals provide a simply delivery mechanism that does not 
easily allow active code to be retrieved from another node. One can make the 
argument that a capsule can be sent to the other node which can then retrieve the 
active code. If so, this would be a standard operation and this capsule can be 
viewed as part of a standard set of capsules that are available on any active node. 
The retrieval operation is an important requirement for a robust active library 
resolution strategy. The active library resolution strategy should be able to request 
code to be transferred or, upon receipt of a query and finding the library locally, a 
node should be able to send to code to the requester. 

Now that the basic elements of the operating environment and code transfer 
mechanisms have been developed, the core service of active library resolution is 
discussed. Active libraries are active code that are used by other active programs. 
The libraries can be stand-alone programs or other libraries. 

Assume the active code arrives at switch C and passes to switch B and then to host 
A and also assume that an active library is required, as shown in Figure 3. Ideally, 
host A would request the library first from switch B and then from switch C 
(Tennenhouse and Wetherall, 1996). However, it is possible that switch B and C 
can either fail or delete required libraries for whatever reason and host A will not 
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be able to execute the active code. If the active code is popular, then clearly the 
active library will be located somewhere on the network; especially considering 
that users within a group tend to use the same applications (Alexander, et al., 
1997a). To handle the situation that the active library may be found along the 
return path of the code or is available locally, the proposed service relies on the 
expanding-ring multicast search (Deering, 1991). This mechanism also allows 
queries to arbitrary and unknown nodes on the network. The expanding-ring 
multicast search first makes a query to all nodes that are on the same network as 
the requester, labeled search 1 in Figure 3. If a library is not found on the local 
network, then the search is repeated so that adjacent networks are included, labeled 
search 2. Again, if the library is not found, the search range is increased again. If 
proxy servers are used, then this model can be expanded to handle private networks 
and potentially provide query translation from different protocols. In addition, 
caching mechanisms can be used so that special active library servers can be 
present in the network. 



Active 

Code 




Figure 3 Active library resolution operational model. 



6 PROTOTYPE IMPLEMENTATION AND EVALUATION 

This research developed an experimental protocol to perform active library 
resolution and a related protocol to transport the active code. The transport 
protocol was developed to show the features that are required of a transport 
protocol for an active network. A prototype implementation was created to verify 
that the resolution and code execution strategy works. As it is impossible to 
measure the performance of a global service without wide-scale deployment and 
the later is clearly not possible, a separate simulation system was built to verify 
that the performance of the system is acceptable and scalable. The simulation 
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system was validated against test runs of the small-scale prototype. The remainder 
of this section discusses key features of the transport protocol and the library 
resolution protocol and presents the experimental results. 

6.1 Active Code Transport and Resolution 

As discussed earlier, active code transport can be performed by retrofitting existing 
protocols and should include MIME headers. MIME headers are text-based and 
are easily extensible. This work developed a test protocol to determine what type 
of headers are required of an active network and what type of protocol, or capsule, 
support is needed. A wide range of headers from naming the active code, 
application programming interfaces (APIs), authoring and copyright information, 
distribution restrictions, usage cost, and payment were investigated. The proposed 
use of MIME headers also introduce a dictionary-based compression scheme to 
reduce the overhead consumed by text headers. 

The headers also provide the mechanism to search for a library. Many of the 
headers can be used as search targets for a query. The searches are performed by 
making regular expression comparisons and simple numerical tests, as appropriate. 
If an active program requires an active library, the name of the active code is will 
typically be used as the search target. Search constraints include verifying that the 
active code in question has the correct API and, perhaps, version or other 
information. The search targets and constraints are treated as a logical AND of all 
previous operations. 

Clearly, in a dynamic Internet of the future, code will be treated as a commodity 
and it is unclear what the economic model will be. Thus, this header investigation 
must be considered preliminary but the results are that any active code transport 
mechanism should support active code delivery and retrieval and provide a 
mechanism to allow capsules at a source node to determine if the remote node 
properly processed the capsule. As these are common requirements, standard 
support must be provided. This research implemented a simple protocol called the 
Active Transport Protocol (ATP) to test these ideas and to provide support for the 
resolution service. 

As noted earlier, the resolution strategy relies on multicast to perform the query. 
Considering a set of MIME headers are used as search targets and constraints, it is 
clear that the query will not fit into a single packet. As the API information is 
expected to be the major component of the search, twenty Linux header files were 
investigated to determine the likely typical size of an API search. 

As shown in Tables 1 and 2, the average function name size is 1 1 bytes and the 
average number of parameters per function is three. The average number of 
composite data types per function is six. This number is obtained by dividing the 
composite type count of 1 15 by the 20 header files. The average length of a 
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composite data type name is ten bytes and the average number of variables used by 
a composite data type is four. About 16 functions and 12 composite data types 
exist in an average header file. Other analyses, reported in (Lee, 1998), confirmed 
that these numbers are reasonable. Because function parameters typically consist 
of well-known structures and names, such as the C int or float primitive data 
types, a high-level of compression can be achieved. Based on this analysis and the 
typical maximum transfer unit of 1,500 bytes, it is likely that most queries will be 
one to three packets in length. 



Table 1 Header File Analysis Results, Part I 



Count 


Total 


Function Name Size (Bytes) 
Average Low 


High 


Function Count 310 


3409 


11.0 


2 


30 


Composite Types 115 


1065 


9.3 


2 


16 


Totals 425 


4474 


10.5 


2 


30 



Table 2 Header File Analysis Results, Part II 


Total (Bytes) 


Number of Parameters 
Average Low 


High 


Function Count 675 


2.1 


0 


9 


Composite Types 407 


3.5 


1 


26 


Totals 1082 


2.6 


0 


26 



Considering the small number of packets required, the Active Library Resolution 
Protocol (ALRP) was designed to support up to 64 packets to be transmitted, which 
allows an estimated 590 uncompressed search constraints to be included in 
576-byte minimum sized IP packets. If a library requires more than 590 search 
constraints, it probably is not well written. 

ALRP uses no error correction, but relies on the fact that in the expanding-ring 
search the packets are naturally retransmitted as the ring is expanded. This 
significantly reduces the complexity of the protocol and the traffic required to 
transmit error correction information. 
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6.2 Simulation Evaluation 

The prototype system verified that active libraries can be resolved from the 
network and installed and executed dynamically. A modified ping (Regents, 1993) 
program was delivered from a source node to a destination node. This ping 
program required a simple checksum library and this requirement was embedded in 
the headers. The library was resolved from the network, installed, and the ping 
program successfully executed. 



Table 3 Test Network Characteristics 



Case 


Group Membership Size Level 


Distance 


Number of Leaf Networks 


i 


10 


0 


2 


2 


2 




0 


2 


2 


3 




1 


3 


10 


4 




2 


5 


100 


5 




4 


9 


1000 



Since system extensibility can be achieved through the addition of MIME headers, 
the other primary criteria for a global system is scalability. As noted earlier, the 
system must be simulated to evaluate performance. The simulator that was created 
used random hierarchical networks at various fixed levels of hierarchy. 

Hierarchical networks closely match the topology of a multicast tree (Deering, 
1991). A number of different cases were used to show scalability, as indicated in 
Table 3. The number of nodes in the network and the levels of the network were 
increased for each case. The system measured the transmission time and number 
of packets sent and also allowed for different link error rates to be set. Two 
different end-to-end error rates were calculated and used which were one and five 
percent. One test case used error rates from one to 75 percent. Because of the lack 
of error correction, the important factor for scalability was the distance away from 
the source. Scalability was measured in terms of linear resolution time and packet 
counts as the distance from requesting client to the designated server increased. 

The simulation did not account for the effects of multicast routing, caching, 
different request sizes and implementations, and similar considerations. The 
simulation does consider networks of different sizes, different loss rates, variations 
in server and source locations, and variations in key factors. Bandwidth for all 
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links is assumed to be a uniform 10 Mbps. Delay and loss rates are assumed to be 
uniform across all links. The simulator was verified against the prototype for case 
1 . 



Net 4 Loss Analysis, Resolution Time 




Figure 4 ALRP loss performance analysis for resolution time. 



Figure 4 shows the analysis of loss for case 4. As the request size and the error 
rate are increased, the resolution time shows a proportional increase. Considering 
the worst case of 64 packets in a query, there was only a four times increase in the 
resolution time and a 3.3 times increase in the number of packets transmitted for 
the 75 percent loss rate compared to the situation with a one percent loss rate. For 
the typical case of one to three packets, there is a negligible difference in 
performance. 

Figure 5 shows the analysis of scalability for the one percent and five percent error 
loss situations. This plot gives the resolution time as a function of the number of 
links and the results presented are an average of the normalized data set. Each data 
set was normalized to the case where the request size was one packet. The average 
of the four cases are plotted. For each loss rate, a linear “best-fit” curve is 
generated and its equation and R 2 value are presented. The closer the R 2 value is to 
unity, the more accurate the linear regression. Figure 5 shows a linear correlation 
for both the one percent and five percent resolution time cases. The slope for both 
curves is significantly less than one. Also, both equations are almost the same. 

This indicates that resolution time scales well, regardless of error rates, and that 
results for larger networks can be determined by the two similar equations. 
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Figure 5 ALRP resolution time scalability. 



For transmitted packet counts (not shown), the one percent error case has a slope of 
1.3 and the five percent error case has a slope of 19. This indicates that low loss 
rates scale significantly better than higher loss rates with respect to the transmitted 
packet counts. Thus, the impact on the network is insignificant for the one percent 
case and moderate for the five percent case for moderately sized to large networks. 
From the above analysis, indications are that unrealistically huge networks will 
have a large number of retransmitted packets at five percent loss. Source to 
destination distances in modem internets generally are not much more than 15 
hops. The significant difference in the two transmitted packet count equations 
indicates that higher error rates will lead to significantly more packets being 
retransmitted. Excessive numbers of packets will begin to degrade resolution time 
performance as network capacity is exceeded. Caching systems should reduce the 
average distance to something much less than 15 hops. 

7 CONCLUSIONS 

Active networks are an emerging area of research that relies on other emerging 
areas of research, most notably concepts from extensible operating system and 
mobile code. This work proposes a unique synthesis of these two areas by 
advocating a generalized paradigm of dynamic networks and nodal processing. A 
node consists of a uniform set of kernel interfaces and a service to dynamically 
obtain other system components from the network. A specific approach to the 
resolution service was discussed and an evaluation of this approach presented. The 
use of a flexible transport strategy combined with MIME headers allows for a 
powerful model of code resolution. Code is treated as an object and the entire 
network is treated as gigantic database. The scalability of the service is controlled 
by using expanding-ring multicast searches and caching systems. The 
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experimental prototype shows that this approach can work and the simulation 
results indicate that it can work well on a large scale. 

Obviously, there is much research that must be performed before this model, or a 
variant of it, can be realized. There are significant issues in communications 
overhead, practical and scalable internal system architectures, standard access 
models, efficient and safe code execution, and unknown distribution and cost 
models. This does not even begin to cover the other challenges in active networks 
(DARPA, 1998b), such as network management and security. 
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Abstract 

With the progression of multimedia middleware and guaranteed network services, 
developers are now presented with flexible frameworks for the development and 
deployment of distributed multimedia applications. New and more advanced 
applications are supporting end-to-end Quality of Service (QoS) guarantees through 
the configuration and management of distributed resources. As an effect of the 
sharing of network and end-system resources across multiple clients, coupled with 
their dynamically changing state, the general end-to-end availability of resources in 
a distributed environment is variable and potentially unpredictable. Thus, the 
provision of QoS constrained services in a distributed environment demands 
carefully controlled and co-ordinated management mechanisms. In this paper we 
discuss the requirements for QoS adaptation mechanisms and QoS-based 
distributed resource management, together with our approaches to QoS adaptation 
and policing, with issues concerning the incorporation of these mechanisms in our 
recently developed Distributed Resource Management Architecture (DRMA). 

Keywords 

End-to-end QoS, Resource Management, QoS Adaptation 



1 INTRODUCTION 



The growth of general networked computing performance is being closely followed 
by the exploitation of this capability by QoS constrained applications. Many of 
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these applications, such as video conferencing and distributed collaborative 
environments, have dynamically changing and potentially unpredictable QoS 
requirements. This problem is exacerbated by the heterogeneous nature and varying 
capabilities of today’s end-systems and global network infrastructures. As a result, 
conventional resource reservation and admission techniques cannot guarantee QoS 
without considerable over-booking and inefficient resource utilisation, something 
which is particularly undesirable in systems that are required to maintain high 
levels of resource sharing. 

To address this problem, and avoid any adverse impact to the end-user, 
distributed applications and their infrastructures need to become adaptive. This 
means that either applications must tolerate fluctuations in resource availability or 
that the supporting infrastructure can itself mould, through distributed QoS 
management, to the dynamically changing requirements of the applications. 
Furthermore, certain applications, especially distributed multimedia applications 
with relatively demanding QoS requirements, simply cannot operate outside strict 
resource requirements, and therefore the need for QoS management arises. For 
example, Video on Demand (VoD) applications often incur varying resource 
requirements through changing counts of ‘viewers’ with different end-system 
capabilities (set-top boxes to high performance workstations). Furthermore, 
because of the heterogeneous nature of network connections and end-system 
resources, offering better guarantees and higher performance is often a infeasible 
and uneconomic solution. If QoS is not carefully managed, and resource utilisation 
scrupulously co-ordinated, then the desired levels service across the application are 
difficult to maintain. Other effects of poor QoS management include loss of 
synchronisation through delayed updates, deadlocks in shared resources, and failure 
of mission-critical applications through resource unavailability. The term ‘QoS 
management’ encompasses the maintenance of a required level of service through 
the co-ordinated configuration and control of end-to-end resources. Because of the 
obvious proliferation and acceptance of distributed object computing, and more 
importantly its usefulness in addressing the problem of QoS management in an 
open distributed system, proposals have already been made for object-based 
management frameworks (Dang, 1995)(ISO, 1997)(Waddington, 1997). However, 
up till now, much of the work has only addressed issues of QoS specification and 
the projection of useful QoS abstractions and services to the application 
programmer. 

In this paper we discuss our approach to distributed QoS adaptation, 
furthering Lancaster’s original work on QoS maintenance and adaptation in the 
QoS-A framework (Campbell, 1994). We now place more emphasis on the 
provision of QoS guarantees in distributed processing environments, whilst also 
addressing the problems of QoS-driven resource management and adaptation in the 
middleware infrastructure. In section 2 we discuss the general requirements for 
distributed resource management and the implications of supporting QoS 
constrained applications. Then, in section 3, we introduce the rationale for adaptive 
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QoS-based resource management, together with support for resource monitoring 
and QoS degradation predication, and its use in instigating adaptation processes. In 
section 4 we discuss some of the issues concerning the engineering and 
implementation of QoS adaptation into distributed multimedia middleware 
platforms, and in particular our recently developed Distributed Resource 
Management Architecture (DRMA) (Waddington, 1997). Finally in section 5 we 
present our conclusions. 

2 QOS PROVISIONING THROUGH RESOURCE MANAGEMENT 

In order to sustain the QoS requirements of a given continuous media application, 
all resources involved in the handling and processing of data from end-to-end, must 
be carefully co-ordinated and managed. The end-to-end QoS is some function of 
the resource utilisation of a distributed application, from client through network to 
server. 




No<m 



QoS Level 2 



Nocm 



QoS Lew©/ 3 






Nam 



Olent 

Resources 

Figure 1 Resource Capacity Regions for different levels of QoS 



Any application which needs to provide a QoS-bound service must maintain its 
distributed resource utilisation within a finite space, which we call the resource 
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capacity region. Figure 1 illustrates this relationship between resource usage in the 
end-systems and the network. The region’s surface represents the balance of 
resources required to sustain a particular end-to-end QoS; hence each level of 
service has associated a different capacity region. For each level of service, 
provided that an application maintains its resource utilisation within the defined 
region, QoS is sustained (note that each capacity region associated with a particular 
level of QoS is represented by its own graph). 

A concept of capacity regions was proposed by Columbia University’s work 
on meeting end-to-end QoS guarantees over high performance packet-switched 
networks. (Hyman, 1995) introduces the concept of a schedulable region which 
describes a finite space representing the number of calls of a given class a particular 
network link can support. This concept was later extended (Lazar, 1995) to 
incorporate capacity regions for end system resources, known as multimedia 
capacity regions , where the summation of the two capacities is used to determine 
end-to-end call admission. 

If an application is unable to maintain its utilisation within the region, possibly 
due to the unavailability of resources, then degradation in the application QoS will 
occur. On the other hand, if an application’s resource utilisation is not close to the 
surface of the capacity region, then resources are not being used optimally which 
may result in the degradation of QoS for other applications sharing the resources. 
Thus, distributed applications providing end-to-end QoS should strive to maintain 
resource utilisation as close as possible to the balance defined by the surface of the 
resource capacity region. So far, our discussion of the resource capacity regions 
has been limited to distributed applications based upon single client/network/server 
scenarios. However, the concept can be readily extended to multi-point 
communication scenarios, resulting in additional dimensions to the capacity graph 
(we are not limited by 3-dimensional space, as capacity regions are actually 
realised as non- visual multi-variable relationships). 
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QoS adaptation is the process of maintenance control, facilitated through 
alterations to either the balance and distribution of resources or to the application’s 
level of service, on short time scales. Adaptation processes often occurs as a result 
of QoS notifications , usually emitted from QoS monitoring mechanisms, which 
indicate a change in the observed service affected through the availability of some 
element of the end-to-end resources. Notifications may indicate a imminent lack of 
resources and hence reduction in service quality (QoS degradation) or a failure to 
maintain service quality through a complete loss of resources (QoS failure). 
Whether degradation or actual failure occurs, QoS adaptation is required to either 
adjust the balance of resources to maintain graceful degradation, or recover service 
quality, or alternatively inform the end-user of the need to alter to a new level of 
service (change in level of service could entail dropping one or more media 
channels, or reducing information resolution). Ideally adjustments in resource 
balance, affected by the adaptation mechanisms, will satisfy the function of 
resource utilisation described by the surface of the resource capacity region. In 
doing so this is likely to involve one or more entities of the resource set (client, 
server or network) increasing their own resource utilisation to counteract 
deficiencies in the failing entity. 

It is also important to realise that the rules of adaptation that we apply to 
simple client-network-server applications can be readily scaled to applications 
which communicate in one-to-many and many-to-many relationships. Figure 2 
presents some example adaptation processes in a one-to-many environment. Figure 
2(a) illustrates a loss in the network resources (for all connections) which is 
counteracted by an increase in client/server end-system resource utilisation 
(perhaps through the incorporation of compression techniques). In examining 
adaptation processes in multi-party relationships there are additional factors which 
must be considered, for instance the number of remote end-systems a particular 
server is communicating with. The example scenario described in figure 2(b) 
shows a drop in resources for a single connection, whilst other network resources 
remain stable. In this scenario it does not make sense to increase utilisation of the 
server and thus, of all the clients, in response to a single client failure. Therefore, 
an alternative action would be required, such as increasing the resource utilisation 
for the individual client (maybe by employing data reconstruction or forward error 
control techniques). If a particular set of resources fail and counteractions cannot 
be made, then a reduction in QoS is inevitable(see figure 2(c)). In such a scenario, 
the adaptation process is required to release other shared resources or inform the 
end-user of a need to change the level of service. Finally, figure 2(d) illustrates that 
in order to effect an increase in end-to-end QoS then an increase in all of the end- 
to-end resources is required. 
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2.1 Hierarchical QoS Management 

From an engineering perspective, QoS management in a distributed system is a 
substantially complex task. An approach which has been proposed in the 
Distributed Resource Management Architecture (DRMA) (Waddington, 1997) is 
hierarchical QoS management. This technique breaks down the task of managing 
end-to-end resources by dividing the problem into a set of finer- grained point-to- 
point requirements which are structured as hierarchical bindings (see figure 3). By 
doing so, mapping and monitoring processes become distributed from end-to-end, 
enhancing scalability and avoiding problems of centralised control. The 
hierarchical management approach is also suited to the engineering of adaptation 
mechanisms. At the upper levels of the structure, adaptation mechanisms are 
responsible for coarse-grained actions, including reconfiguration and change of 
service. Descending towards the leaves of the hierarchy, adaptation mechanisms 
become finer-grained, usually in the form of atomic resource control. The 
processes responsible for the adaptation actions are maintained within binding 
components (bindings are simply communication abstractions between one or more 
potentially distributed object interfaces). Furthermore, each individual binding is 
responsible for maintaining, through monitoring and adaptation, the point-to-point 
QoS characteristics which are defined by its interfaces. 
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Figure 3 Hierarchical QoS Adaptation 

2.2 Core Distributed Resources 

The term ‘resource’ is inherently vague. Resources can exist at varying levels of 
abstraction. For instance at the highest level the term could include an end-system 
or a network node. At a lower level, a resource could represent a physical device or 
some system wide shared resource such as a network connection or access to a 
physical disk. However, we believe that there is a finite set of resource, in the end- 
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system and the network, in terms of which all other resources can be described. 
Therefore we propose that, at least to begin with, we should concentrate on 
monitoring a limited set of resources and make adaptations accordingly. 



Table 1 Core Distributed Resources 



Area 


Resource 


Description 


End-System 


CPU 


Processor cycles 




Physical Disk Access 


Disk I/O requests 




Memory 


Paged and non-paged memory usage 




Cache 


File and I/O caching 




Auxiliary Memory 


Device buffers, video memory 




I/O Devices 


Serial/parallel ports 




Peripherals 


Video camera, microphone, etc. 




NSAPs 


End-system access Ports 


Network 


Bandwidth 


Traffic throughput 




Buffer Space 


Router buffer requirements 




Switch/Router Ports 


Network channels/paths 



The core resources, as described in the above table, are shared by applications 
across the distributed environment. This finite set covers the majority of resources 
which are shared in a distributed system. By careful management and control of 
these, we can begin to prioritise resource utilisation and hence offer QoS- 
guarantees. 

2.3 Resource Scheduling 

In order to successfully share resources across distributed applications and, 
furthermore, offer sufficient guarantees on their availability (an obvious 
requirement for time critical continuous media applications), resources need to be 
scheduled. Scheduling is the process of determining the availability of resources at 
a particular instant in time. Through resource reservation, an application can 
request resources and in return the system can determine whether sufficient 
resources are available to service the given request. 

Scheduling algorithms can only be effective if they are used in advance of the 
admission of the resource. However, the time scale between resource admission 
testing and admission is dependent upon the scheduling algorithm. Many 
algorithms, of varying complexity and usually focused at either the network or the 
end-system, have been suggested (Hyman, 1995)( Anderson, 1993). However, in 
scheduling resources for multimedia applications the use of exhaustive or statistical 
techniques is often inappropriate (applications such as VoD can be statically 
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analysed, but dynamically changing applications such as video conferencing 
cannot). It is suggested that simple heuristic scheduling techniques are preferable, 
offering a low processing overhead, and thus being more suited to resource 
requirements which vary in real-time. A basic model for resource reservation is 
offered by (Wolf, 1995). In this model, resources are requested by the application 
to the individual resource manager (i.e. network or end-system). The resource 
request describes the resource requirements of the application and the duration of 
the requirement. Provided that there are sufficient available resources, the resource 
manager returns a confirmation , otherwise the request is refused and a failure is 
returned. On successful completion of the negotiation phase, the client contacts the 
resource manager at the point when previously reserved resources are required. 
The demand is acknowledged, and the client can then use the resource for the 
duration of the reservation. This technique of reservation in advance does require 
that the duration of the reservation can be calculated a priori and that the resource 
usage is scheduled from the reservation request. There are techniques, such as 
partitioning, which can be used to couple with non-advance resource reservation 
systems; however we feel that the reservation in advance scheme is suitable for our 
purposes. 

3 ADAPTIVE RESOURCE MANAGEMENT 

Many of today’s applications continuously adjust their resource requirements. This 
dynamic nature is particularly evident in distributed multimedia applications which 
demand strict levels of QoS. Furthermore, the problem is intensified in general 
purpose distributed environments because resources, in the network and the end- 
system, must be shared across multiple contending applications. Some operating 
systems, especially real-time systems, employ resource scheduling techniques (as 
previously discussed) in an attempt to alleviate the problem. Reservation and 
admission does allow a system to offer firm guarantees provided that the resource 
can be guaranteed, e.g. wireless communication bandwidth cannot always be 
guaranteed. However, many applications, such as distributed games, have varying 
resource requirements which cannot be predicted. One solution is to over-book 
resources and hope that the application does not demand resources beyond those 
made in the reservation. This is not an ideal solution as it is likely to result in the 
inefficient use of resources. So what is the rationale behind the use of resource 
reservation and admission techniques in a general purpose operating system at all? 
Why not simply rely directly upon monitoring and adaptation? In response, we 
suggest that compared with adaptation processes, resource reservation and 
admission processes are relatively lightweight and their processes demand little of 
the system. Furthermore, the majority of applications know their resource 
requirements a priori , and therefore are suited to a model of admission and 
reservation. 
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Coupled with the problem of varying client requirements, is the indeterminate 
nature of end-to-end resource availability. This difficulty is particularly evident 
within wide-area QoS guaranteed networks, which create an end-to-end connection 
through the concatenation of multiple hops. Because each hop in the connection 
continuously change state, QoS routing techniques must be employed in 
determining a suitable end-to-end route. As a result of the inability to determine 
the exact state of end-to-end resources, we propose the use of two-tiered ‘loose’ 
reservation and admission in order to maintain a best-effort management of 
resources. Loose resource reservation implies that approximate state metrics are 
used in admission and reservation, and that only statistical guarantees can be made. 
In the event that an applications resource requirements do change, then adaptation 
techniques are used to maintain QoS. Thus, the resulting system combines the 
benefits of resource reservation and admission, with the flexibility of monitoring 
and adaptation. 

3.1 Monitoring and Prediction 

Monitoring is the process of observing the utilisation of resources and/or QoS 
characteristics in the system. It is the responsibility of the monitoring process to 
observe events and provide messages indicating the occurrence of QoS contract 
violations. There are two approaches to monitoring, intrusive and non-intrusive. 
Intrusive monitoring means that the monitoring process takes periodic samples of 
resource availability and utilisation; this in turn means that resources are consumed 
by the monitoring process itself, an overhead which is sometimes unacceptable. 
Alternatively, monitoring processes can rely upon indications of events which are 
not within the bounds of an agreed QoS contract. For instance, a monitoring 
process may receive indications from a media decoder concerning the dropping of 
video frames. This approach means that the monitored data set is much smaller and 
often aperiodic, however, resources are only consumed by the monitoring process 
in the event of QoS degradation or failure (some researchers may argue that this 
would cause deadlocks since a system should not consume more resources at a 
point of QoS failure). 

Many traditional QoS-based adaptive systems use indications of QoS failure 
to initiate adaptation actions. As a consequence any resource management and 
adaptation which is carried out by the system is, more often than not, readily 
noticeable to the end-user. To avoid such disjunction, it is suggested that we should 
attempt to predict the need for adaptation. Some sources of monitoring, such as the 
CPU usage of a video decoder, are often unpredictable 1 . You can see from the 
results in figure 4 that the utilisation of resources is often application and 
implementation dependent. The resource utilisation from the playback of a local 



1 The figures were taken from PerfMon on Windows NT 4.0 whilst decoding videos of 
comparable content through DirectShow 2.0 software decoders. 
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MPEG video is relatively periodic, however the playback of an Indeo video which 
can be considered to be a similar application, uses CPU resources much more 
sporadically. Furthermore, even if resource utilisation patterns can be identified, 
they are often too fine grained to be useful. Nevertheless, prediction techniques are 
useful for coarser grained trends in resource utilisation and/or QoS characteristics. 
In such cases we can introduce simple statistical prediction algorithms, such as 
extrapolation and regression, to make an estimation of the probability of a failure 
occurring. Prediction techniques do depend upon the resource being monitored and 
the process 2 variants which are causing the system to degrade. Variants are either 
system-driven, such as caching techniques, or user-driven, such as a new selection 
of service. User-driven variants are often stochastic processes (meaning that they 
are random and have a mean of zero), and are therefore very difficult to predict in 
the short term. However, system-driven variants often result in a relatively 
predictable pattern, allowing statistical prediction algorithms to be used to 
extrapolate future resource variations and hence enable adaptations to be employed 
before the point of failure. 




0 20 40 60 80 100 

t 



Figure 4 Example CPU Utilisations 

3.2 QoS Requirements and Benefit Functions 

The objective of any general purpose operating system is to share resources across 
multiple applications, in a fair and efficient manner. In doing so, the ultimate goal 
of the system is to fulfil the end-user’s requirements, whether that be displaying 
video without interruption or maintaining a data backup in the event of a system 
crash. End-user requirements can be defined as subjective or objective . An 
example of a subjective end-user requirement is the impairment of transmitted 

2 In this context the term process is used to denote the generation of monitoring information. 
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video, often used in subjective testing in the engineering community. It is possible 
to use such perceptual QoS metrics to help optimise multimedia communication 
services (Verscheure, 1996). 

Alternatively requirements can be classified as objective. These requirements 
are directly associated with a particular QoS metric and hence are more easily 
described and specified; examples include frame rate and network bandwidth. For 
this reason, traditional QoS-based resource management systems tend to 
concentrate on the management of objective requirements. However, there does 
exist a direct relationship between subjective and objective QoS. Furthermore, this 
relationship is particularly important in defining, and prioritising, which objective 
QoS requirements contribute to the overall goal of the system, satisfying the end- 
user. The relationship between the two can be quantified as a functional 
expression, known as a benefit function , which is a technique originally proposed 
by (Davis, 1994). The generalised abstraction allows the specification of arbitrary 
objectives (or subjective QoS requirements), and their relation to resource 
utilisation and/or objective QoS requirements. In turn, the function can be used by 
a resource manager to determine which adaptation processes are most beneficial to 
the end-user. An example of a use of a benefit function is in describing the 
relationship between frame jitter and frame size and overall subjective quality of 
service. From the video benefit function shown in figure 5 (Davis, 1994), it is 
apparent that once finite thresholds of frame jitter and frame rate are reached (0.2 
and 0.7 respectively) the benefit of using more resources to increase the resulting 
subjective QoS is negligible. 




1.1 Adaptation Mechanisms 

To structure our model, we now define the mechanisms required to support QoS 
adaptation in a distributed environment. Many resource fluctuations in a distributed 
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system can be handled implicitly by the processing entities themselves. For 
example, a video decoder which receives a burst of frames and cannot process these 
before the next frames are expected may decide to drop a portion of the frames 
from the burst. In such a case, because the adaptation is very fine grained, the 
likelihood of the adaptation process being noticeable to the end user is small. 
However, more sizeable fluctuations may become apparent through resource 
monitoring and prediction as previously discussed, and given that a system 
monitoring process has indicated that QoS degradation is occurring, or QoS failure 
is imminent, ideally the system should adapt its resource utilisation in an attempt to 
maintain the end-user’s level of service. As discussed in section 2, the process of 
adaptation, particularly in a distributed environment, involves addressing the 
balance of resources between the clients, servers and the network connections; the 
role of adaptation processes is to initiate such a balancing. 



Table 2 Adaptation Mechanisms 



Mechanism 


Technique 


Resource Usage Shift 


Resource 

Control 


Static Rate Shaping 
Dynamic rate shaping 
Cache Optimisation 
Priority Adjustment 
Scaling 

Network Service Control 


client/network -> null 
client/network -> server 
server -> server 
client_a -> client_b 
network -> server 
client/server network 
network -> client/server 


End-to-end 

Reconfiguration 


Dual Codec Insertion 
Client Side Coder Insertion 
Server Hand-off 
Network ‘swap in’ 

Service provider fault action 


network -> client/server 
network/server client 
server_a -> server_b 
network_a -> network_b 
network_a -> network_a 


Explicit Change 
of Service 


Audio channel drop 
Video Adjustment 


client/network/server -> null 
client/network/server -> null 



We identify three classes of adaptation mechanisms: resource control , making fine 
grained adjustments to individual resources in the distributed system; 
reconfiguration , altering the topology of the end-to-end processing; and change of 
service , allowing the user to prioritise services and adjust as necessary. The 
majority of adaptation mechanisms fit into one of these classes. Each mechanism 
has a certain granularity, and each may affect the resource distribution in a slightly 
different manner. Table 2 offers some examples of adaptation mechanisms and 
indicates the resulting shift of distribution in resource utilisation. The actual choice 
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of adaptation mechanism in response to received monitoring indications, is dictated 
by a set of system-wide adaptation policies. An adaptation policy describes, in 
conjunction with the applications resource capacity regions, what adaptation 
processes (control, reconfiguration or change of service) should be executed in 
response to various QoS scenarios. Furthermore, policies can be used in 
conjunction with the previously discussed benefit functions, allowing the 
prioritisation of adaptation mechanisms. Policies are applicable from end-to-end 
and are associated with any of the core distributed resources and their processing 
configuration. We suggest that the specification of policies should employ open 
interfacing techniques, aiding extensibility and readily understood programming 
level abstractions. 

4 IMPLEMENTATION PERSPECTIVES 

This section is concerned with the incorporation of the discussed QoS adaptation 
techniques into the Distributed Resource Management Architecture (DRMA) 
currently being developed by Lancaster University and BT Labs (Waddington, 
1997). The DRMA platform focuses on the deployment of distributed multimedia 
applications over multi-service ATM networks and offers a framework for the 
hierarchical management of end-to-end resources. In addition, the implementation 
exploits open interfacing and distributed object techniques to aid scalability, 
flexibility and extensibility. 

4.1 Adaptive Bindings 

Within the DRMA, adaptation, monitoring and control processes are all engineered 
as a set of hierarchically structured binding objects (see figure 3). A binding object 
is a particular class of component which is used to abstract the functionality of a 
‘binding’ between one or more source and sink components; a binding represents 
any form of communication (in the network or in the end- system) between these 
components. The combination of binding objects and other processing components 
is used to form the distributed application. Because the binding objects are 
hierarchically distributed, the adaptation and monitoring mechanisms also become 
distributed, thus avoiding the problem of centralised control and management. 
Each binding object maintains a set of interaction interfaces which describe the 
level of service offered through the binding communications, and furthermore 
define the point-to-point QoS constraints. The role of the binding object is to 
maintain the desired QoS, as specified by its interfaces, and in doing so it may, if 
required, employ monitoring and adaptation mechanisms. More detail on 
distributed binding objects is given in (Waddington, 1997). 
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\J\J\f\J\f\£ monitoring notification ► adaptation control invocation 

Figure 6 Explicit Hierarchical Adaptation 



At the lowest level of the binding hierarchy, atomic processing components are 
linked together to form an end-to-end processing chain. Because adaptation and 
control at this level are very fine grained, the adaptation processes are both implicit 
(autonomous control adjustments) and explicit (control adjustments from parent 
bindings). As a consequence, care must be taken to avoid implementing adaptation 
policies which conflict or which may result in continuous counteractions. Explicit 
adaptation actions are usually triggered as a result of either degradation 
notifications or higher level changes in service requirements (see figure 6). In the 
former scenario, indications of QoS failure or degradation are passed from the child 
to the parent bindings. Adaptation policies are then parsed to determine what 
actions should be taken, and then any necessary adaptation processes are executed 
by the local binding object. If the binding object is unable to correct the 
degradation through adaptation (by a lack of suitable adaptation policies), or 
attempts at adaptation have failed, then the QoS notification is forwarded to the 
parent binding object (all notifications are passed through clearly defined QoS- 
feedback object interfaces). This process is continued until either successful 
adaptation actions are executed, or in the event of un-correctable failure the end- 
user is notified. 

Example Adaptive Binding 

As previously discussed in section 2.1, coarse grained adaptations occur within the 
higher level binding objects, such as the service binding. We now describe an 
example adaptive binding which is used to provision a multimedia streaming 
service across multi-domain ATM network. The binding hierarchy, as illustrated in 
figure 7, is composed of a service binding which delegates the end-to-end 
communications to concatenated end-system and network bindings. 




323 



During the initial setup, the service binding is responsible for mapping 
application level QoS requirements onto end-system and network level 
requirements (a process often referred to as QoS mapping). It is likely that many 
such mappings exist for a particular end-to-end service. Nevertheless, in 
determining the choice of end-to-end resource, the service binding object must 
make consideration of current resource availability (determined through 
monitoring) and their cost, a metric which is particularly relevant to network 
resources. Once a mapping has been determined the binding objects are 
instantiated and their QoS requirements exchanged. The next phase of the setup 
involves each individual binding object carrying out further QoS mapping to 
determine what resources are required to maintain its previously agreed point-to- 
point requirements. If there are insufficient resources, then the binding object must 
indicate an admission failure to the parent service binding. In our prototype 
implementation, the end-system binding uses statically defined mapping 
information to create a chain of end-system processing components (such as 
decoders and Tenderers) which meet the QoS requirements previously requested. In 
admitting the components, the end-system binding must also ensure that there are 
approximately (because we are using loose reservation) sufficient available CPU, 
memory, and other end-system resources. The setup of the network binding objects 
is slightly different. The network binding must employ some form of connection 
setup and routing, to establish a point-to-point QoS constrained network service. In 
our prototype implementation, we have used the ATM forum’s UNI/NNI based 
signalling to create a guaranteed service to a Winsock2/AAL5 protocol stack. 




During the lifetime of the service binding, their are potentially changes in the state 
of end-to-end resources which may in turn require QoS adaptation. For instance, 
one likely scenario is that the level of network service that was initially reserved is 
not sufficient for the application’s streaming requirements (maybe video frames are 
being lost). Such mismatches in reserved or available resources, and the actual 
required resources, are indicated through QoS monitoring fed back from the 
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network objects (see section 4.1). On receipt of QoS degradation notifications, it 
becomes the responsibility of the service binding to initiate QoS adaptation. The 
choice of adaptation is made through a finite state machine representing the 
originally discussed resource capacity region. This is used to determines which 
adaptation policies are valid for the given level of service. If the binding is unable 
to make any suitable adaptation, then the component must indicate QoS failure to 
the application. Within this context likely adaptation scenarios would be 
adjustments in the network QoS (maybe through the setting up of a new ATM 
connection and carrying out a hot swap) or alternatively the incorporation of 
compression components in the end-systems. In some cases adaptation actions may 
take the form of reconfiguration. For example, consider the scenario where the 
initially chosen network service provider can only support a limited 25Mbps 
connection and a requirement of 26Mbps becomes evident. In this case the service 
binding object may choose to use an alternative network service provider and 
release the previously allocated network resources. Finally, in parallel with the 
previously discussed coarse-grained adaptations, the system is likely to experience 
various fluctuations in low level QoS, which are counter- acted through intrinsic 
fine grained adaptation actions. 

5 CONCLUSION 

We have proposed a general model for the support of QoS-based adaptation and 
resource management in distributed multimedia systems. Our perspective is across 
the complete end-to-end co-ordination and control of network and end-system 
resources, supporting end-to-end QoS constrained processing and communications. 
The general model of QoS adaptation incorporates the following principles: 

• QoS provisioning through distributed resource management; 

• Hierarchical approach to QoS management, leading to scalability; 

• Combination of ‘loose’ resource scheduling and dynamic adaptation policies; 

• Adaptation through both monitoring and prediction; 

• Indications of end-user requirements through functional modelling of benefits. 

We have also made progress towards the implementation of our proposed QoS 
adaptation mechanisms, using QoS adaptation techniques in the development of the 
prototype Distributed Resource Management Architecture (DRMA); DRMA 
provides a distributed platform for the development and deployment of QoS- 
constrained continuous media applications. The system, based on Windows NT, 
uses distributed object programming methods and hierarchical QoS management to 
support adaptive bindings which transparently encapsulate the proposed resource 
monitoring and adaptation mechanisms. Our future work is directed towards a 
fuller implementation of the Distributed Resource Management Architecture 
together with further support for QoS management and QoS adaptation, in both the 
network and end-system bindings. 
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Abstract 

In this paper we discuss the need for resource reservation in the Internet and 
examine some of the strengths and weaknesses of RSVP, which is currently the 
most popular of Internet reservation protocols that have been developed. The 
deficiencies of RSVP motivate our design of a new resource reservation protocol 
which uses dynamic sender-initiated reservations to achieve a highly bandwidth- 
efficient reservation mechanism with excellent scalability with regards to round 
trip time, data rate and number of hosts. 
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1 INTRODUCTION 

It is clear that the current Internet which was founded upon the concept of ‘best- 
effort’ datagram delivery must be enhanced in some way in order to accommodate 
the changing communications environment. In particular there is a growing 
demand for real-time applications which have specific Quality of Service(QoS) 
requirements, especially with regard to end-to-end delay and minimum bandwidth, 
both of which cannot be guaranteed in the current Internet using traditional 
connectionless best-effort delivery. Furthermore as the World-Wide-Web is 
increasingly used for business there is a growing number of users for whom delay 
bounded access of information is important. 

In response to the changing requirements of Internet users, much attention has 
focussed on the use of resource reservation as a means of providing selected data 
flows with special QoS commitments in accordance with their needs. Under such a 
framework it is likely that special QoS delivery would be the exception rather than 
the rule with the majority of Internet traffic continuing to receive the ‘default’ best- 
effort mode of delivery. The special QoS required by a specific data flow can be 
realised by reserving resources(bandwidth, buffer space) and installing appropriate 
scheduling behaviour in each router along the end-to-end path followed by the data 
flow. Such mechanisms require admission control at the individual intermediate 
nodes to ensure that the request for reservation is only accepted and installed 
provided sufficient resources are available. In addition, per-flow state' in the 
intermediate nodes will usually be required in order to identify the flows to receive 
special QoS as well as the QoS to be received. 

In order to allow users to invoke special QoS delivery on demand for a data 
flow several protocols have been developed to enable users to communicate their 
QoS needs to the intermediate routers along the data path in an IP internetwork. 
The majority of these protocols initiate the set up of flow-specific reservation state 
in intermediate routers, a notable exception being the approach described in 
(Almesberger, 1997) whereby no per-flow reservation state is set up in routers 
which instead record their reservation commitments as a whole per output port. 
While this approach potentially offers very good scalability characteristics for a 
large number of flows, it is dependent upon a certain degree of trust among end 
hosts not to exceed their indicated traffic levels unless per-flow policing is applied 
at the network access point. In addition, the approach is only able to offer end 
applications an approximate minimum bandwidth without any quantitative 
guarantees on loss or delay and so may not be suitable for applications with 
stringent QoS requirements such as Distributed Interactive Simulation 
(Seidensticker, 1997). 

Of the reservation protocols that set up flow-specific reservation state, an early 
example in the Internet is the Stream Protocol, ST (Forgie, 1979) which was 



1 The introduction of per-flow state is a significant departure from the initial Internet design 
philosophy of a pure connectionless network with no per-flow state in the intermediate 
routers. 
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limited to unicast reservations. Although its successors, ST-II (Topolcic, 1990) 
and the more recent ST2+ (Delgrossi, 1995) can handle both multicast and unicast 
reservations as well as possessing many improvements over ST, the ST group of 
protocols has attracted little commercial interest. By contrast, another reservation 
protocol, RSVP (Braden, 1996) has received significant industry support and with 
good reason. Unlike the ST protocols, RSVP reservation state is soft-state and will 
time-out in the absence of any refresh reservation requests within a certain time 
period. This so-called soft-state nature of RSVP provides a very simple failure 
recovery mechanism over a wide range of fault scenarios and helps to retain much 
of the robustness that has helped to make IP so successful. The soft-state approach 
where the end applications are responsible for maintaining the flow-specific router 
state leads to a significant reduction in complexity compared to a hard-state 
approach where the network is responsible for maintaining the flow-specific 
router-state. RSVP has other notable architectural differences compared to the ST 
protocols such as receiver-initiated rather than sender-initiated reservations 2 . The 
initial design of RSVP was to a large extent influenced by the needs of multicast 
conferencing applications although its intended use is now much broader. 

While RSVP is concerned merely with signalling the end application’s 
reservation requests to the intermediate nodes, it is the special QoS delivery 
models 3 that define the node behaviour required to meet the signalled special QoS 
objectives. The Integrated Services Working Group(intserv) of the DETF(intserv 
1998) has standardised several special QoS delivery models while the Integrated 
Services over Specific Lower Layers(issl) Working Group of the IETF(issl 1998) 
has developed ways of mapping this network layer QoS onto specific link layer 
technologies such as ATM, IEEE 802 and Ethernet. 

In parallel with the recent Internet growth much interest has been generated by 
Asynchronous Transfer Mode(ATM), a technology designed from the outset with 
end-to-end QoS in mind. A necessary component in ATM networks for achieving 
QoS on demand is a signalling protocol in order to request resource reservations in 
the intermediate nodes of the end-to-end path. More traditional ATM signalling 
protcols such as ITU’s Q.2931 standard for public networks (ITU-T, 1995) or 
ATM Forum’s UNI standards for private networks (ATM Forum, 1996) use end- 
to-end handshaking to set up an end-to-end reservation before data transfer can 
take place. A more dynamic and flexible approach is that provided by the ATM 
Block Transfer/Immediate Transfer(ABT/IT) (ITU-T, 1996) signalling protocol 
which sends reservations in-line with data and as such is more conducive to 
efficient bandwidth utilisation than the more static end-to-end handshaking 
approach. 



2 

ST-H+ permits both sender and receiver-initiated reservations, ST-II and ST permit sender- 
initiated reservations only. 

3 At present two delivery models have been standardised, both of which offer applications 
an end-to-end minimum bandwidth albeit with different assurances. First, Guaranteed 
Service which offers applications a loss-free service with an end-to-end delay bound. 
Second, Controlled -Load Service which does not provide any quantitative guarantees on 
delay or loss, although qualitatively these parameters can be expected to be the same as for 
best-effort delivery under low network load. 
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In the next few sections we present a new QoS signalling protocol known as 
Dynamic Reservation Protocol(DRP) which could be used to set up the IETF’s 
integrated services models ‘on-the-fly’ in IP internetworks. We outline the benefits 
of DRP compared to RSVP before presenting details of packet formats and 
processing rules and our conclusions. 

2 DYNAMIC RESERVATION PROTOCOL (DRP) OVERVIEW 

Our protocol, known as Dynamic Reservation Protocol(DRP) incorporates many 
principles of RSVP along with the dynamic sender-initiated reservation concept of 
ABT/IT to achieve the following goals: 

• High control dynamics to achieve efficient bandwidth usage for both sender- 
specific and shared reservations. 

• Scalability of router-state with regard to number of senders and receivers. 

• Scalable and simple approach to One Pass With Advertising (OPWA) 4 . 

• Minimal receiver complexity. 

• Minimal number of messages to implement session-wide reservation changes 
in large-scale multicast sessions. 

• Heterogeneity of reservation QoS classes among receivers of the same session. 

DRP allows reservations to be set up ‘on-the-fly’ by sending Reservation packets, 
RES in-line with the data flow. In this respect, DRP is similar to ABT/IT although, 
unlike ABT/IT, DRP does not need to make an end-to-end connection before 
sending its first in-line reservation packet. Also, unlike ABT/IT, DRP does not 
support the concept of a sustainable cell rate for the data transfer and consequently 
the probability of acceptance of a reservation request is determined purely by the 
available resources at that moment in time. As with RSVP, all flow-specific router 
state that is set up using DRP is soft-state as we believe that this is a key strength 
of RSVP that can also be used to good effect in DRP. 

The scheme also uses Return (RTN) packets that are reverse-routed up the tree 
to provide the intermediate routers/switches and sender with certain feedback and 
end-to-end path information. DRP is applicable to both unicast and multicast 
scenarios but in the following sections we concentrate on the more complicated 
multicast case. 

3 DRP DESIGN PRINCIPLES 
3.1 Sender initiated reservations 

The use of in-line reservation packets allows the sender to set up new reservations, 
or alter existing reservations, on demand at any point in the data transfer. This 



4 This is a term introduced by RSVP, to describe a mode whereby all necessary information 
is made available(advertised) in advance of making a reservation request so that the correct 
level of reservations necessary to achieve the target end-to-end QoS can be determined and 
installed in ‘one pass’ of the reservation message. 
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makes it possible to achieve a very close match between the instantaneous service 
provided by the network and the instantaneous requirements of the data flow. As a 
result, network resource usage can be minimised. These benefits are particularly 
prominent for stop/start data flows since the resources can be freed during the quiet 
periods and re-installed on a just-in-time basis at the start of each activity burst. 
Such action is precluded with both RSVP and traditional ATM signalling 5 
approaches, both of which incur a time lag in excess of the round-trip time when 
modifying end-to-end QoS to reflect a change in the sender’s traffic stream 
characteristics. 

Another advantage of using sender-based reservations rather than receiver- 
based reservations is a reduction in the volume of processing required at 
intermediate nodes of a multicast tree each time a sender changes its traffic stream 
characteristics. With RSVP, each time this occurs and the receivers consequently 
modify their reservations, it is possible for a node to install a reservation due to a 
request from a particular receiver, only for it to increase the reservation a moment 
later when a larger request from a different receiver arrives at the same interface. 
In fact when the multicast tree serves a large number of receivers it is possible that 
some reservations may be updated several times before settling down to their 
steady state values. This effect will be particularly prominent for a Guaranteed 
Service session since each receiver will probably need to request a different 
reservation bandwidth even if they require the same end-to-end delay bound. By 
contrast with DRP, a single pass of the RES packet down the multicast tree will 
typically 6 achieve the new steady state reservations in the on-tree nodes. 

3.2 Heterogeneity of QoS reservation classes between receivers of 
the same session 

DRP allows the sender to request, ‘on- the fly’, intserv’s Controlled-Load Service 
(Wroclawski, 1997) and Guaranteed Service (Schenker, 1997) by sending a RES 
packet in-line with data. The sender designates a ‘Ceiling’ Reservation class (or 
Type), CRTs to each data flow block 7 as well as an associated end-to-end QoS 
level. In addition each receiver specifies a ‘ceiling’ reservation class, CRTr which 
represents the highest quality reservation class it is willing to receive. The 
Guaranteed Service reservation class is taken to be the highest quality reservation 
class, followed by Controlled-Load Service with ‘best-effort’ (no reservation) being 
the lowest. Assuming that sufficient end-to-end resources exist, the effective end- 
to-end reservation class received by a receiver will then be given by MEN(CRTs, 
CRTr). Each receiver is free to change its value of CRTr at any time by sending a 



5 Traditional ATM signalling(e,g, Q.2931 and UNI) requires end-to-end handshaking. 

6 In the case of Guaranteed Service, DRP may sometimes use feedback to alter certain 
reservations after the first pass in an attempt to achieve a target end to end delay bound that 
was not satisfied on the first pass as described in section 3.6. 

7 A data flow defined by the combination of (sender IP address, sender port, destination IP 
address, destination port, transport layer protocol) can be considered as a series of data flow 
blocks, each of which may have its own specific QoS requirements. 
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RTN packet upstream containing the new value of CRTr. In addition, the 
reservation class installed at each on-tree outgoing interface will be the lowest 
quality reservation class that is necessary to guarantee each receiver their effective 
end-to-end reservation class as determined by the above rules. This is exemplified 
in Figure 1. 
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Figure 1: Heterogeneity of reservation classes between receivers of same session. 

Table 1 compares the DRP approach to providing end-to-end delay bounds with 
those of RSVP and ABT/IT. DRP is similar to ABT/IT in the sense that for a given 
sender to a multicast session the target end-to-end delay bounds will be identical 
for each receiver 8 . The difference is that with DRP the sender can control what this 
delay bound will be whereas with ABT/IT the delay bound is a feature of the QoS 
class provided by the network and cannot be controlled by end nodes. Like DRP, 
RSVP facilitates end node control of end-to-end delay albeit by receivers rather 
that senders. RSVP allows receivers finely grained control within a reservation 
class at the expense of added receiver complexity together with lack of support for 
reservation class heterogeneity among receivers of a session. By contrast, DRP 
supports such reservation class heterogeneity in that the sender suggests a 
reservation class and QoS level for all receivers who then have the option of 
downgrading QoS class. Although DRP does not offer receivers any control over 
QoS level within a class, we do not believe such a feature is necessary anyway and 
certainly not with regard to end-to-end delay bound. We argue that any such end- 
to-end delay bound is determined by the nature of the sender’s traffic stream and as 



g 

In the case of DRP we are only referring to those receivers that have actually requested an 
end-to-end delay bound, i.e. those with CRTr=GS. 
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such the sending application is the node most qualified to specify what it should 
be. The receivers simply need to be told what this end-to-end delay bound is so that 
they can set their playout buffers accordingly. Furthermore, removing receiver- 
control of end-to-end delay in DRP enables merging of RTN messages and RTN 
state in routers to ensure scalability to large multicast sessions as described in 
section 3.4. 





RSVP 


ABT/IT 


DRP 


Sender control of delay bound 


No 


No 


Yes 


Receiver control of delay bound 


Yes 


No 


No 



Table 1: Comparison of different schemes with regard to provision of end-to-end 
delay bounds 

3.3 Reservation request admission control. 

Apart from the initiator of QoS requests(sender vs receiver) there are two other 
notable differences between the reservation mechanisms used by RSVP and DRP. 

1. Explicit vs implicit reservation requests. 

With RSVP, for both Controlled-Load and Guaranteed Service reservations, the 
request explicitly informs the node of the level of resources to reserve. The same 
can also be said of a Controlled-Load Service reservation request using DRP. 
However with a DRP Guaranteed Service request the RES packet requests the level 
of resources to reserve implicitly by informing the router of the accumulated delay 
bound thus far, together with the target delay bound and the sender traffic 
characteristics. Using this information along with path information obtained from 
RTN packets each router is able to estimate the local reservation required and 
update the accumulated delay bound in the RES packet accordingly. Each router 
calculates a local reservation bandwidth which, if also reserved in each subsequent 
router, will lead to an overall delay bound equal to the target delay bound. 
However, should any router have insufficient resources to install the calculated 
local reservation bandwidth then it reserves the most that it can and the attempt is 
only referred to as a ‘reservation failure’ if the resultant accumulated delay thus far 
exceeds the target delay bound. If the attempt is not a so-called ‘reservation 
failure’ then the RES message is treated the same regardless of whether the level of 
local reservation initially calculated could be reserved or it couldn’t. This is 
because even in the latter case the target end-to-end delay bound may still be met 
since each subsequent router will automatically attempt to reserve more in order to 
compensate. The action taken in the event of a so-called ‘reservation-failure’ is 
discussed next. 

2. Action in event of reservation failure. 

With RSVP, any request that fails admission control at a router is not propagated 
any further along its path towards the sender(s) and a ResvErr message is sent to 
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affected receiver(s). By contrast, whenever a DRP node cannot satisfy the 
calculated local reservation, and the maximum level of resources that it can reserve 
is so low that it prevents the target end-to-end QoS from being satisfied, the 
request is not rejected. Instead, the node reserves as much resources as possible 
and sets specific QoS violation bits in the RES header while updating the other 
header fields in the usual manner before propagating the RES message down the 
distribution tree. 

In the event of a DRP Controlled-Load Service ‘reservation-failure’, the node 
sets a bit, known as the QoSvoid bit, to 1. The RES packet is handled in the usual 
way by all subsequent routers encountered, although the presence of the non-zero 
QoSvoid bit will be an indication to receivers that the end-to-end QoS could not be 
achieved. 

In the event of a Guaranteed Service request where the node could not reserve 
more than the mean rate of the sender’s traffic, it becomes impossible to guarantee 
either lossless transmission or conformance to the target delay bound, or even the 
Controlled-Load Service. Consequently three flags should be set in the RES 
packet, namely the delayvoid, lossvoid and QoSvoid bits. When downstream 
routers see a RES packet with (CRTs=GS, delayvoid=lossvoid=l) then they take 
the ‘effective CRTs’ to be Controlled-Load Service(CL) and attempt to install a 
Controlled Load Service Reservation. 

In the event of a Guaranteed Service request where lossless transmission has 
not yet been precluded 9 but the accumulated bound at a node exceeds the target 
delay bound for the first time, the action taken is as described in section 3.6. 

In the case of Guaranteed Service reservations in a large multicast tree, there 
are some interesting differences between DRP’s sender-based reservations and 
RSVP’s receiver based reservations. To illustrate these differences we refer to the 
example topology of Figure 2. This shows the logical connectivity of a multicast 
session between a sender, S and two receivers, R1 and R2. These end nodes are 
interconnected via routers, rl-r3 and all links are 10Mbps Ethernet. The exported C 
and D error terms (Schenker, 1997) from the routers are shown together with the 
token bucket parameters of the Sender Tspec. We will assume that both receivers, 
R1 and R2 require a queuing delay bound of 300ms to sender S. With RSVP , each 
receiver calculates an Rspec that to be reserved in each router along the end-to-end 
path in order to achieve its delay bound. In this example, R1 calculates an Rspec of 
325.3Kbytes/s while R2 calculates an Rspec of 490.89 Kbytes/s. At router r2 these 
two requests are merged so that the Rspec propagated to router rl is 
490.89Kbytes/s. Packets from S to receiver Rl will now experience a reservation 
bandwidth of 490.89kbyte/s in router rl, interface 2 rather than the requested 325.3 
Kbytes/s. This will cause a reduction in Rl’s end-to-end delay meaning that 
theoretically the bandwidth reserved for Rl in r2, interface 2 could be decreased 
from the initially calculated value of 325.3 Kbytes/s while still achieving Rl’s end- 
to-end delay bound. However Rl does not facilitate such a mechanism and in this 
example Rl’s end-to-end delay bound will be less than, rather than equal to, that 
requested. By contrast, DRP keeps a running total of end-to-end delay bound 

9 That is, every node so far has been able to reserve in excess of the mean rate of the 
sender’s traffic 
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which it updates at each hop and uses to calculate the local reservation required to 
stay on course for the desired end-to-end delay bound. As a result, in this example 
DRP automatically readjusts the reservation level in r2, interface 2 where 247.7 
Kbytes/s is then reserved rather than 325.3 Kbyte/s as in RSVP. 

For multicast examples such as this where the routers and links are 
homogeneous(same values of C, D error terms and link propagation delay) RSVP 
will never use fewer resources than DRP. However in environments with 
heterogeneous routers and links the matter is not as straightforward. With DRP, in 
a multicast environment the bandwidth to be reserved at each node is calculated 
based on, among other things, worst-case merged(see section 5.3)path 
characteristics received from RTN messages. The effect of this worst-case merging 
can be for DRP to make an over-estimation in the local reservation. Any such over- 
estimation will cause a reduction in the local node queuing delay. In turn this will 
mean that DRP allows an increase in the local queuing delay at nodes further 
downstream whose reservations will then not be as high. This ‘skewing’ of the 
bandwidth reservation pattern in multicast sessions whereby nodes closer to the 
sender are more likely to over-estimate their local reservations can theoretically 
cause an increase in the overall reservation bandwidth required in the multicast 
tree. This is an area for further study. 
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Figure 2: Guaranteed Service Reservations using RSVP and DRP 



3.4 Merging of RTN messages 



DRP uses RTN messages which are reverse-routed up the distribution tree from 

receivers) to sender(s) for the following purposes: 

1 . To accumulate certain path characteristics information which is used by a node 
when calculating the level of resources to reserve. 

2. To allow a receiver to downgrade its received reservation class below that 
suggested by the sender. 

3. Optional feedback information that may be used to convey information to 
intermediate routers in cases where the end-to-end delay bound was not 
satisfied in the first pass of a RES message. 

With respect to 1., RTN messages fulfil a similar role to Path messages in RSVP. 
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Figure 3:DRP RTN messages on shared tree per refresh period in the steady state. 
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Figure 4: RSVP Path messages on shared tree per refresh period in the steady state. 

For large-scale multipoint-to-multipoint applications the use of a single shared tree, 
such as a Core Based Tree(CBT), for all senders to a multicast group will consume 
far less resources (Billhartz, 1997) than a separate source-based tree for each 
sender to the group. Because of this, a single shared tree is likely to be preferable 
to a mesh of source-based trees in such scenarios. In such cases DRP displays 
much more favourable scalability characteristics than RSVP. With DRP, full 
merging of RTN messages is possible and ensures that the number of RTN 
messages on each link of a shared tree in the steady state is never more than 
two(one in each direction) every refresh interval as shown in Figure 3. By contrast, 
with RSVP the total number of Path messages on each link of a shared tree per 
refresh period in the steady state is equal to the number of senders as shown in 
Figure 4. However perhaps a more important benefit of the DRP approach is the 
fact that the number of RTN state entries in each on-tree router is equal to the 
number of on-tree logical interfaces and so never becomes an issue no matter how 
many hosts are sending to the group. By contrast, with RSVP the number of Path 
state entries in each router and end-host of a multicast shared-tree is equal to the 
number of senders to the group and consequently may become excessive for large- 
scale multipoint-multipoint applications. 
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3.5 Shared-Session Reservations and Intra-Session Reservation Style 
Heterogeneity 

RSVP supports shared-style reservations which match on multiple senders and are 
usually used on the understanding that only one sender will be active at once. 
Although this will yield resource savings compared to a number of sender-specific 
reservations, shared-style reservations that are set up using RSVP can still be sub- 
optimal for 2 reasons. First, the reservation stays in place during quiet periods. 
Second, during active periods the reservation may sometimes or always be larger 
than necessary to meet the agreed QoS for the data flow currently using it (White, 
1998). Both of these inefficiencies are obviated with DRP if the shared-session 10 is 
handled using sender-specific reservations that are installed and tom down on-the- 
fly at the start and end of each activity burst. However, in such cases it may be 
possible for end users to detect a degradation in QoS at the start of each activity 
burst due to the finite time required to install the ‘just-in-time’ sender-specific 
reservation. One way in which such QoS disruption could be minimised is to use 
what we refer to as a ‘simple shared reservation’ which would apply to all senders 
to the session and would be left in place during quiet periods but modified at the 
start of an activity burst each time the sender to the session changed. With this 
approach QoS disruption would only occur at the start of an activity burst for a 
new sender whose reservation requirement was greater than that of the previous 
sender, but even in this case the QoS disruption would be minimised because of the 
presence of the now free reservation from the previous sender that can be used as a 
starting point for the ‘just-in-time’ reservation request from the new sender to build 
upon. 

While the ‘simple shared reservation’ mechanism just described would work 
well in the true ‘shared-session’ case where there is never more than a single active 
sender at any one time, it would suffer from under or over-reservations 11 in cases 
where it is possible for multiple senders to be simultaneously active which might 
occur in the absence of appropriate conference control mechanisms. This 
deficiency of a simple shared reservation approach is highlighted in Figure 6 for 
the example traffic pattern of Figure 5. Bearing these potential hazards in mind, 
DRP provides an alternative reservation mode to the standard sender-specific(SS) 
mode known as Sender-Specific with Residue(SSR). In SSR mode each sender 
makes a reservation at the start of each activity burst and sends a teardown request 
at the end of the activity burst. When a sender’s teardown request 12 reaches an 
outgoing interface of a router the SSR reservation of the sender will only be 
removed if at least one other SSR reservation for the session is in place in the 



10 where only one sender to the session transmits at once. 

11 That is, over-reservations in addition to the ‘over-reservations’ present when the 
reservation stays in place during quiet periods. 

12 A teardown request is simply a RES packet with the reservation level set to 0. It indicates 
to the intermediate routers that the sender no longer requires a reservation for its data 
packets. 
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router at that outgoing interface. Otherwise the sender’s SSR reservation is left in 
place, but a status flag associated with the reservation is set from ‘active’ to 
‘passive’ state. 

Figure 6 illustrates the operation of SSR mode for the traffic pattern of Figure 
5. When only one sender is active at once, the operation of SSR mode is essentially 
the same as with a ‘simple shared reservation’ and so will suffer from resource 
wasteage when all of the senders go simultaneously quiet 13 . However should the 
senders go simultaneously quiet for extended periods of time the soft-state nature 
of the reservation will cause it to eventually timeout and be removed. In cases 
where more than one sender is simultaneously active, the operation of SSR mode is 
essentially the same as SS mode and so cannot suffer from the under-reservation 
problem that exists with the ‘simple shared reservation’. 

A notable advantage of DRP compared to RSVP is that DRP allows co- 
existence of both modes of reservations within the same multicast session while 
with RSVP each receiver within a given multicast session must choose the same 
reservation style. The way in which co-existence of reservation modes within the 
same multicast session is accommodated in DRP is summarised as follows. 

When a reservation request arrives at an on-tree incoming router interface 
it is copied to each on-tree outgoing interface where the following steps are 
applied: 

If reservation for that sender already exists 

• Set reservations’s mode flag(0=SS, 1=SSR) according to mode field in 
RES packet. 

• Adjust reservation level to value indicated in RES packet. 

• Set reservation’s status flag to 1 (active). 

Else if mode field in RES packet indicates SS 

• Create a new reservation according to filter spec and value indicated in 
RES packet. 

• Set reservation’s mode flag to indicate SS. 

• Set reservation’s status flag to 1 (active). 

Else if a SSR reservation exists with state = 4 passive ’ 

• Set filter spec of that reservation to the sender of the RES packet. 

• Adjust reservation level to value indicated in RES packet.. 

• Set reservation’s status flag to 1 (active). 

Else 

• Create a new reservation according to filter spec and value indicated in 
RES packet. 

• Set reservation’s mode flag to indicate SSR. 

• Set reservation’s status flag to 1 (active). 

13 For example in a multimedia conference if the audio channel used a different multicast 
group to the other multimedia traffic components there might be significant periods of time 
where the audio channel was quiet. By contrast in an audio-only conference the channel is 
unlikely to be quiet for any lengthy period of time. 
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When a reservation teardown arrives at an on-tree incoming router interface 
it is copied to each on-tree outgoing interface where the following steps are 
applied: 



If reservation mode is SS 

• Remove reservation 

Else if total number of installed SSR reservations including this one is greater than 
one 

• Remove reservation. 

Else 

• set reservation’ s status flag to 0(passive) 




Figure 5: Traffic Pattern for a shared session with some sender transmission 
overlap 




‘Simple Shared Reservation’ Reservation Using DRP in SSR Mode 

Figure 6: Bandwidth reserved for a shared session with some sender transmission 
overlap. 



3.6 Using feedback to increase the probability of achieving end-to' 
end delay bound 



In the case of Guaranteed Service reservations the adoption of a strategy whereby 
an end-to-end reservation is only permissible by installing an equal reservation in 
each router reduces the chances of meeting a target end-to-end delay bound. This 
characteristic has been noted by the designers of Guaranteed Service and exploited 
to a reasonable degree through the introduction of a slack term (White, 1997) into 
the reservation flow specification. Use of the slack term enables higher 
reservations to be made between the receiver and the bottleneck router to 
compensate for the increase in delay incurred by the lower reservation in all routers 
between, and including, the bottleneck router and the sender. However it does not 
permit the reservation to be increased once it has passed through the bottleneck 
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router on its way towards the sender. Such a restriction can sometimes prevent 
RSVP from achieving the target delay bound even on a path that actually contains 
enough resources to meet the target end-to-end delay bound. 

Unlike RSVP, DRP employs cooperation and feedback between routers to 
ensure that if a given end-to-end path is capable of supporting a specific target 
delay bound then DRP will always meet the target delay bound. In the example of 
Figure 7 each router is able to reserve a bandwidth in excess of the mean rate of the 
sender’s traffic but at router r3, accumulated delay for the first time is in excess of 
the target delay bound, Dt. Consequently, router r3 sets a bottleneck flag associated 
with its local reservation as well as setting botdeneck and delayvoid flags in the 
forwarded RES message. When r4 receives the RES message it notices that the 
delayvoid flag has been set to 1 and so as a result reserves the maximum 
reservation that it can(subject to any installed policy decisions) in order to 
minimise its contribution to the accumulated delay. When R receives the RES 
message it notices that the target delay bound has been exceeded and immediately 
issues a RTN packet containing the amount by which the target delay has been 
exceeded in a field in the packet known as the excess delay field. In addition, a bit 
in the packet known as the bottleneck bit, is set to 0. This RTN packet is reverse- 
routed up the tree but is ignored at each router until its bottleneck flag has been set 
to 1 which will occur when it reaches the interface of a router, r3 in this example, 
in which the bottleneck flag has been set to 1 for the installed reservation. The 
RTN packet will then travel hop by hop towards S with an attempt being made at 
each hop to eliminate the excess delay or at least reduce it as much as possible by 
increasing the level of the local reservation on the appropriate outgoing interface. 
If a router succeeds in reducing the excess delay to zero then the RTN packet will 
cause no further alterations in local reservations on the rest of its journey towards 
S. In this example, r2 is able to increase its local reservation and cause a reduction 
in its local queueuing delay of del which is then subtracted from the excess delay 
field before sending the RTN packet to rl which manages to increase its local 
reservation sufficiently to completely eliminate the excess delay. The target end-to- 
end delay bound has now been achieved. 



Accumulated delay bound Set bottleneck flag=l 



RES(dl<Dt) RES(d2<Dt) RES(d3<Dt) ▼ RES(d4>Dt) RES(d5>Dt) 




RTN(O) RTN(d5-Dt-del) RTN(d5-Dt) RTN(d5-Dt) RTN(d5-Dt) RTN(d5-Dt) 



t 

excess delay 



Figure 7: Use of DRP feedback mechanism to maximise chances of meeting target 
delay bound for Guaranteed Service. 



In the next two sections we present details of the main fields required in RES and 
RTN packets together with the processing rules in order to provide the basic 
functionality of DRP described in the previous sections. 
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4 RESERVATION(RES) MESSAGE 

The IP destination address of the IP datagram encapsulating a RES message is 

equal to the session destination address while the IP source address is equal to the 

initial sender of the RES packet. The IP router alert option is used to ensure that 

intermediate nodes intercept and process the RES packets. 

4.1 RES message Common Part 

• Session - (object defined in the RSVP protocol) - it contains the destination 
address, transport layer protocol identifier and transport layer destination port. 

• Phop - (object defined in the RSVP protocol) - it is the identity of the last 
DRP-capable logical outgoing interface to forward this message. The Phop 
object consists of the pair (IP address, logical interface handle) and is required 
to install Phop state in the router to ensure correct reverse routing of RTN 
messages. 

• Sender Template - (object defined in the RSVP protocol) - it is a filter 
specification identifying the sender. It contains the IP address of the sender 
and optionally the sender port(in the case of Ipv6 a flow label may be used in 
place of the sender port) 

• timestamp field - this is stamped with the time of the local node clock just 
before being forwarded to next hop(s) down the distribution tree. It is used to 
calculate dnext as described in (White, 1998). 

• CRTs field(2 bits) - this identifies the ceiling reservation class of the sender. 
11 indicates Guaranteed Service, 10 indicates Controlled-Load Service, and 00 
indicates best-effort. 01 is currently unspecified although may at some time be 
used for a new service with quality in between best-effort and Controlled-Load 
Service. 

• Tspec describing sender’s traffic characteristics using the following token 
bucket representation as described in (Schenker, 1997) 

p = peak rate of flow (bytes/second) 
b = bucket depth (bytes) 
r = token bucket rate (byes/second) 
m = minimum policed unit (bytes) 

M = maximum datagram size (bytes) 

• end2end delay field - this gives the current delay from when a packet was 
transmitted by the initial sender until it is due to arrive at the incoming 
interface of the current next hop. 

• Mode field(l bit) - this identifies the reservation mode. A value of 0 indicates 
SS mode while a value of 1 indicates SSR mode. 

• QoSvoid bit - if set to 1 this indicates that no QoS guarantees can be offered. 
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4.2 RES message Guaranteed Service object 

If CRTs = 11 (Guaranteed Service) the RES packet will also contain a Guaranteed 
Service object comprising the following: 

• CSum - accumulation of C values since last upstream reshaping point (see 
(Schenker, 1997)). 

• DSum - accumulation of D values since last upstream reshaping point (see 
(Schenker, 1997)). 

• target-bound field which indicates the target end-to-end delay of the sending 
application. 

• accumulated-bound field which indicates the installed delay bound between 
sender and the incoming interface of the current next hop. 

• Flags field containing 

• delayvoid bit (If set, this bit is an indication to the receiver that the target 
delay bound cannot be guaranteed) 

• lossvoid bit (If set, this bit is an indication to the receiver that a loss-free 
service cannot be guaranteed) 

4.3 Node Processing of RES messages 

When a node receives a RES packet for an end-to-end reservation attempt at which 
QoS violation has already occurred or which occurs following processing of the 
RES packet, the behaviour of the node is as described in sections 3.3 and 3.6. 
Otherwise the processing of the packet is as described in the remainder of this 
section. 

Upon receipt of the RES message the node passes it to admission control which 
then determines the reservation that needs to be made at each of the outgoing 
interfaces. The reservation class is given by MIN(CRTs, CRTr) where CRTr is 
obtained from the Merged RTN State Entry(MRTNSE) for the appropriate 
outgoing logical interface as described in the next section. 

If the reservation request is for the Controlled-Load Service then the 
reservation is governed entirely by the sender Tspec contained within the RES 
message. By contrast if the reservation request is for Guaranteed Service then the 
reservation is described by the combination of the sender Tspec and a reservation 
bandwidth, R that the admission control mechanism needs to determine using the 
following equations as given in (Schenker, 1997). 

( b-M)(p-R ) ( M + Ctot ) n , _ . ... 

Qdelay endlmd = — — — + + Dtot (case p > R >= r). (1) 

K{p-r) K 

Qdelay end2end = (M + Ctof) +Dtot (case R >= p >= r). (2) 

A 
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Admission control obtains the parameters M, p, b and r from the sender Tspec 
contained in the RES message. The value of Ctot is given by the sum of the 
router’s local C value and the merged Ctot value as obtained from the 
MRTNSE(see next section) for the relevant outgoing interface. Likewise the value 
of Dtot in the above equations is given by the sum of the router’s local D value and 
the Dtot value in the MRTNSE for the relevant outgoing interface. To obtain the 
value of Qdelay to insert into the above equations, admission control uses the 
relationship given in equation (3). 

Qdelay = target-bound - accumulated-bound - dnext - propdelay. (3) 

where the target-bound and accumulated-bound are obtained from the 
corresponding fields in the RES packet, and propdelay and dnext are obtained from 
the MRTNSE for the outgoing interface. 

Once the resultant value of Qdelay has been substituted into equations (1) and 
(2) along with the other mentioned parameters, a value of R to be installed at the 
outgoing interface is obtained. 

Regardless of the reservation class, that is Controlled-Load or Guaranteed 
Service, if processing of the RES does not result in either the installation of a new 
reservation or a modification of an existing reservation(i.e. the RES packet was 
simply a refresh) then the soft-state timer for the reservation is simply reset. 
Otherwise, the reservation request is propagated immediately down the distribution 
tree after updating the appropriate fields in the packets header as follows. The 
end2end delay field in the RES packet is increased by adding to it the following: 

• The propagation delay, dnext for the next hop 

• An estimate of the current local queuing delay(for the relevant outgoing 
interface) for data packets of the flow to which the RES packet refers. 

In addition, if CRTs=l 1 the following updates must be made to the RES packet: 

• Add the following to the accumulated-bound field of the copied packet. 

1 . The propagation delay, dnext for the next hop 

2. The installed local queueing delay boundffor the relevant outgoing 
interface) for data packets of the flow to which the RES packet refers. This 
local queuing delay bound is obtained by inserting the reserved value of R 
into equation (4) along with the local values of C and D 

Q,oca,=J + D. (4) 

• If reshaping to the sender Tspec is being performed at the outgoing interface 

set Csum=Dsum=0. 

Else 

Add the local value of C to the CSum field 
Add the local value of D to the DSum field 
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Once updating of the fields is complete the timestamp field is now set equal to the 
local clock before forwarding the RES packet to each next hop down the routing 
tree. 



5 RETURN(RTN) MESSAGE 

The IP destination address of the IP datagram encapsulating an RTN message is 

equal to the IP address of a previous hop node, the identity of which is obtained 

from installed Phop state obtained from RES messages, while the IP source address 

is equal to the IP address of the node out of which the RTN message was sent. 

5.1 Common part 

• Session - as for RES message. 

• Nhop - (object defined in the RSVP protocol) - the identity of the DRP- 
capable logical outgoing interface that sent this message. The Nhop object 
consists of the pair (IP address, logical interface handle) 

• Sender address - the combination of this field and the session object identify 
a source-based tree. In the case of a shared tree this field is ignored and should 
be set to all 0’s. 

• timestamp - stamped with the time of the local node clock just before being 
sent to previous hop up the distribution tree. This is used in calculation of 
dnext as described in (White 1998). 

• CRTr (2 bits) - indicates the receiver’s ceiling reservation class. 

• timedelta - used in calculation of dnext as described in (White 1998). 

• propdelay - the data packet propagation delay along the maximum ‘Total 
Rate-Independent Delay’ (TRID) path 14 between the node incoming 15 interface 
out of which the RTN packet was sent and each receiver downstream. 

• pathMTU - the minimum pathMTU value between the incoming interface out 
of which the RTN packet was sent and each receiver downstream of that 
incoming interface. 

• Ctot - the maximum accumulated Ctot value along the paths between the 
incoming interface out of which the RTN packet was sent and each receiver 
downstream of that incoming interface. The C error term is defined in the 
Guaranteed Service specification (Schenker, 1997). 



14 Total Rate Independent Delay(TRlD) is given by the sum of the link propagation delays 
and the D error terms. 

15 The term ‘incoming’ refers to the direction of data flow. RTN packets are reverse-routed 
up the distribution tree in the opposite direction to the data flow and so are always sent out 
of so-called incoming interfaces. 
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• Dtot - sum of D error terms along the maximum ‘Total Rate-Independent 
Delay(TRID)’ path 14 between the node incoming 16 interface out of which the 
RTN packet was sent and each receiver downstream. The D error term is 
defined in the Guaranteed Service specification (Schenker, 1997). 

• path bandwidth - the maximum path bandwidth value along the paths 
between the incoming interface out of which the RTN packet was sent and 
each receiver downstream of that incoming interface. 

5.2 RTN Guaranteed Service feedback object 

The RTN packet may optionally contain a Guaranteed Service feedback object 
comprising: 

• excess delay field - the amount by which the installed end-to-end delay bound 
currently exceeds the target end-to-end delay bound. 

• bottleneck flag - if set to 1 this indicates that the RTN message has travelled 
at least as far as the the router where the accumulated delay-bound first 
exceeded the target delay-bound on the first pass of the RES message. 

• Sender Template - same as that in RES packet whose end-to-end delay 
bound was exceeded. 

5.3 RTN state and message merging rules 

At an outgoing interface, i of a router on the distribution tree, reception of an RTN 
packet from a next hop, j results in the updating of any matching router state, 
known as an RTN state entry or RTNSE iJ? or the setup of new state if no match 
exists. There will be a separate RTNSE y for each 4-tuple (Session, sender address, 
next hop, outgoing logical interface). The first three parameters of this 4-tuple are 
contained within the received RTN message while the outgoing logical 
interface(oif) is determined by the interface on which the RTN message arrived. In 
the case of a shared tree the sender address field will be omitted for the RTNSE ir 
The format of an RTNSE..(excluding any guaranteed service feedback parameters) 
is as shown in Table 2. In addition, for each outgoing logical interface, i a single 
Merged RTN State Entry (MRTNSEj) is created from the set of entries {RTNSEjJ 
for that logical outgoing interface. There will be multiple RTNSE y s for a given 
logical outgoing interface if the logical outgoing interface has multiple next hops 
on the distribution tree which can occur if the logical outgoing interface connects 
to a shared medium LAN(e.g. Ethernet). The parameters of the MRTNSEj and how 
they are formed from {RTNSE y ) are also shown in Table 2. The ‘merged values’ 
of various parameters in each RTN message sent out of an incoming interface to a 
previous hop upstream are obtained from {MRTNSEJ, the set of MRTNSE for the 
outgoing interfaces as shown in Table 2. 

16 The term ‘incoming’ refers to the direction of data flow. RTN packets are reverse-routed 
up the distribution tree in the opposite direction to the data flow and so are always sent out 
of so-called incoming interfaces. 
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RTNSEy 


MRTNSE, 


Merged RTN packet sent upstream 
out of interface k 


CRT, 


CRT S =M AX { CRT. } 


CRT k =M AX { CRT. } 


Ctot 


Ctot=M AX { Ctot } 


Ctot k =M AX{ Ctot+Clocal ki } (footnote 17) 


Dtot, 


Dtot=Dtot 
Where j is such that 
TRIDpTRID, 


Dtot k =Dtot i +Dlocal lc . 

such that i gives MAX { Dlocal^+TRIDj } for 
that interface k (footnote 17) 


Propdelay. (j 


Propdelayppropdelay^ 
Where j is such that 
TRIDpTRID. 


Propdelay lc =propdelay.+dnext i 

such that i gives MAX} Dlocal^+TRIDj } for 

that interface k (footnote 17) 


PathBandwidth.. 


PathBandwidth.= 

MAX{ PathBandwidth..} 


PathBandwidth k = 

MAX{MIN(pathbandwidth s , link rate.)} 


PathMTLL 


PathMTU= 

MIN { pathMTLL } 


PathMTU k = 

MIN{MIN(path MTU,, linkMTU.)} 


dnext. 


dnext=dnext. 
where j is such that 
TRIDpTRID. 




TRID.= 

Dtot 

+propdelay. j 

4-dnext. 

•j 


TRID.=M AX { TRID y } 




sender template s 


sender template s 


sender template s 


excessDelay 


excess delays 
MAX{ excessDelay 


excess delay = 

MAX{ excessDelay ^ - delayReductionJ 


bottleneckFlag^. 


bottleneckFlag s = 
MAX{ bottleneckFlagJ 


bottleneck flag =MAX{ bottleneck flagj 



Table 2: relationship between RTN state entries, MRTN state entries and merged 
RTN packets. 



17 Clocalid, Dlocalid =router’s value of C and D error terms between incoming interface k 
and outgoing interface i 
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The last three rows of Table 2 represent optional GS-feedback objects and are 
written in italics to differentiate them from the core entries shown in the table. 
Merging between GS-feedback object state only occurs if the objects relate to the 
same sender template, s. A merged GS-feedback object for sender template, s is 
only included in the merged RTN packet sent upstream if the RTN packet is 
addressed to Phop for sender template s as obtained from installed RES state. With 
regard to the excess delay entries shown in the table, delayReduction^ refers to the 
local reservation queuing delay reduction achieved since the RES for sender s at 
interface i was installed. 

If CRTr is not equal to GS in the propagated RTN message, the rules in Table 2 
are overridden by setting Ctot=Dtot=propdelay=0 in order to ensure that only those 
links receiving Guaranteed Service are taken into account when conducting worst- 
case merging of GS-specific parameters. 

Whenever the contents of a RTN message to be sent upstream differ from the 
preceding one, the RTN message is sent immediately. Otherwise, i.e. in the steady 
state, an RTN message is sent to a previous hop once per some refresh period. 

6 SUMMARY 

In this paper we have discussed the need for resource reservation in the Internet 
and examined the use of RSVP for this purpose while highlighting some of its 
favourable characteristics such as its use of ‘soft-state’ reservations. Consequently 
we acknowledge RSVP as a useful starting point in the design of alternative 
reservation protocols but we do not accept that it represents the ultimate solution 
because of certain deficiencies and restrictions that we demonstrated in the text. 
This has motivated our design of an alternative IP reservation protocol, DRP which 
incorporates many principles of RSVP together with the dynamic sender-initiated 
reservation concept of ABT/IT to achieve the following main goals: 

1 . High reservation control dynamics to achieve efficient bandwidth usage. 

2. Scalability of router-state with regard to number of senders and receivers. The 
protocol is especially suited to large-scale-multicast applications where it can 
expect to achieve a router state saving of several orders of magnitude 
compared to RSVP. 

3. Heterogeneity of QoS classes and reservation styles for nodes within a given 
multicast session. 

Details of control messages were presented along with associated processing rules. 
Although in principle DRP offers considerable benefits over existing reservation 
protocols certain aspects of it are not well understood and further work is required 
especially in the following areas: 

1 . Reservation setup time for each of the different service classes. 

2. Impact of SSR mode on reservation set up time compared to SS mode. 

3. Effect of worst-case merging of OPWA data - For a large multicast tree this 
will tend to cause the nodes closest to the sender to over-estimate their local 
reservations which as a result causes a reduction in the local reservations 
downstream. Any implications of this phenomenon! need to be clarified. 
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4. Investigation into alternative Guaranteed Service feedback techniques for the 
purpose of reducing the end-to-end delay bound when it is in excess of the 
target-delay bound after one pass of the RES packet. For example one 
alternative worth investigating is the generation of the RTN packet containing 
the Guaranteed Service feedback object as soon as the bottleneck node is 
encountered rather than waiting until the RES packet arrives at the receiver. 
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Abstract 

In this paper, we present a differentiated service scheme called “User-Share 
Differentiation (USD)”. The USD scheme is designed for long-term bandwidth 
allocation without per-session signaling. The scheme allows ISPs to provide traffic 
isolation on a per-user basis and guarantee proportional fairness. We first look at 
the background for differentiated services, and the problems with the current 
proposals. We then present the details of the USD scheme and examine the 
implementation and deployment issues. 

Keywords 

Quality of services, differentiated service, weighted fair queuing, bandwidth 
allocation, proportional fairness 



INTRODUCTION 

The current Internet is built on the best-effort model where all packets are treated 
as independent datagrams and are serviced on the FIFO basis. The best effort 
model does not provide any form of traffic isolation inside the network and the 
network resources are completely shared by all users. As a result, the Internet 
suffers from the “Problem of Commons” where greedy users try to grab as much 
resource as possible. Such a system can become unstable and lead to congestion 
collapse. The Internet currently still works because most end systems use TCP 
congestion control mechanisms and back off during congestion. However, such 
dependence on the end systems’ cooperation is increasingly becoming unrealistic. 
Inevitably, people start to exploit the weakness of the best effort model to gain 
more resources. An example of this is to establish multiple TCP connections in 
web browsers to gain greater share of the bandwidth. The best effort model also 
prevents ISPs from meeting the different needs of their customers since it is 
difficult to allocate more resources to those who are willing to pay more. 
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The problems with the best effort model have been long recognized. For the last a 
couple of years, QoS provision has been one of the hottest areas in networking 
research, and various aspects of the issue have been extensively studied including 
traffic analysis, admission control, resource reservation, scheduling, QoS routing, 
and operating system support. The architectures of various proposed solutions 
differ in details. Nevertheless, the underlying model is rather similar. Essentially, 
applications make resource reservation on an end-to-end per session basis. We 
refer to this model as the End-to-End Per Session (EEPS) model. In the Internet 
community, RSVP and INT-SERV are examples of protocols and service models 
based on this model (Zhang et al 1993). In general, the EEPS model achieves QoS 
guarantees through the following steps: 

• The application characterizes its traffic, and describes its requirements in a 
flow specification. 

• The QoS routing figures out one or more candidate paths based on the 
requirements. 

• A reservation or signaling protocol then checks for admission control hop-by- 
hop and installs the reservation over the candidate path if there is sufficient 
resources. 

• The schedulers enforced the reservation for each flow. 

The past work based on the EEPS model has given us valuable insights and 
practical experience with resource allocation in the Internet. However, the EEPS 
model has a number of problems: 

• On-demand per-session reservation does not work well for Web-based 
applications. For applications with long lasting sessions such as video 
conferencing, the delay and overheads of reservation is minimal. But for 
transaction-based applications such as Web, where a user can go through 
many destinations in a few seconds, setting up a reservation for each 
transaction has a high overhead. Furthermore, the resource requirement for 
Web traffic is usually difficult to determine. Very often, a user does not know 
how big an object is before fetching it. Also, delay variation affects Web- 
based application far less drastically compared with applications like video 
conferencing, thus there is more space for adaptation. 

• Security, accounting and administrative support represent a significant amount 
of overheads in resource reservation. As each reservation is a service contract 
between the user and ISPs along the path, resource reservation goes far 
beyond simply installing state inside the network. Each request has to be 
authenticated and the user’s account to be charged. Within the user 
organization, there may be internal procedures for approval and coordination 
of requests from individual users. All those have two implications. First, 
accounting and administrative support has to be an integrated part of a 
resource reservation system before die system can be deployed. Second, there 
is a need for aggregated reservation in order to reduce overheads of resource 
reservation for short-lived sessions. 
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• The inter-ISP settlement is a complex issue and is unlikely to be resolved in 
the near term. When a path traverses multiple ISPs, an end-to-end reservation 
requires an agreement among all major ISPs on the inter-ISP settlement. Any 
single ISP that does not participate in such an agreement may break the 
reservation. Incremental deployment measures are essentially for the success 
of any reservation systems. 

• The fine granularity of the EEPS model can lead to some scalability problems 
as the number of reservations increase. The classifier has to check the five 
fields (source address, destination address, source port, destination port and 
protocol) to determine if a packet belongs to one of the reserved flows. Such 
fine granularity lookup can be expensive when the number of flows is large. 
When the sessions are short-lived, the control messages can also be substantial 
and the processing of the control messages may become the bottleneck. 

At the time when the work on the EEPS model started a couple of years back, real- 
time applications such as video conferencing were regarded as the mainstream 
application for the future Internet. For such applications, the on-demand per- 
session reservation makes sense. However, the advent of the Web has changed the 
landscape significantly. The majority of the Internet traffic today is web-based and 
tends to be short-lived and transaction-oriented. 

In the recent months, the term “differentiated services” has been used to describe 
new service models and mechanisms to achieve bandwidth allocation for 
aggregated traffic without per-session reservation. The basic requirements for such 
new service models and mechanisms are as follows: 

• Aggregated bandwidth allocation without the need for per-session signalling. 

• Long-term service contracts within a single domain. 

• Integrated and simplified accounting. 

• Better traffic isolation for performance predictability. 

• Better services to users who are willing to pay more. 

In this paper, we present a scalable differentiated service scheme called “User- 
Share Differentiation (USD)” (Wang 1997). The USD scheme is designed for 
long-term bandwidth allocation without per-session signaling. The scheme allows 
ISPs to provide traffic isolation on a per-user basis and guarantee proportional 
fairness. We first look at the background for differentiated services, and the 
problems with the current proposals. We then present the details of the USD 
scheme and examine the implementation and deployment issues. 



RELATED PROPOSALS 

In this section, we examine two related proposals that have been put forward for 
differentiated services. 
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Premium Service 

The premium service model described by Nichols et al (1997) provides guaranteed 
peak-rate bandwidth for aggregated traffic flows from users at die ISP entry points. 
The proposal creates a “premium” service that is provisioned according to the 
worst-case requirements and guaranteed by priority queuing. Routers at the edges 
of the network filter packets and set the premium bit in the packet header 
according to users’ premium bandwidth profile. Inside the network, packets with 
the premium bit set are transmitted prior to the best effort packets. 

The premium service proposal represents an extreme form of resource allocation, 
where the network capacity is effectively reduced by the amount allocated to the 
premium class traffic and the best effort traffic suffers all the consequences of 
congestion. The premium class requires strict admission control at the entry points 
to the ISP. Users’ traffic is shaped to the allocated bandwidth. The arriving packets 
that exceed the allocated bandwidth profile are either delayed or dropped. 

Assured Service 

The assured service defined in the profile-based tagging scheme uses drop priority 
to differentiate traffic (Clark and Wroclawski 1997). Each user is assigned with a 
service profile that describes the “expected capacity” from the ISP. The traffic 
from a user is checked by a profile meter at the entry points to the ISP. Packets that 
are out of the profile can still go through but they are tagged as such by the profile 
meter. When congestion occurs inside the network, the routers drop the tagged 
packets first. When traffic is complaint to the agreed profile, it is expected that a 
user can have predictable level of services. 

The premium service proposal and the profile-based tagging proposal are similar 
in that both proposals create an “upper” class and give preference to the upper 
class over the best effort class. In both proposals, the classification of packets is 
done at the edges and the class information is encoded in the packet header. The 
difference lies in that the premium service drops packets that are out-of-profile at 
the entry points to the ISP while the assured service still allows such packets go 
through in the hope that they may get to their destinations. 

The two proposals attempt to push all policy-related processing to the edges of 
the network, and inside the core, the routers just forward packets based on the ToS 
bits in the header. While this approach has the advantage of simplicity, there are 
also a number of problems: 

• Provisioning. Both proposals attempt to provide guaranteed bandwidth to the 
upper class traffic through admission control at the edges of the network. The 
assumption here is that with admission control around the edges and proper 
provisioning of the network, one can effectively eliminates congestion for the 
upper class. However, given a set of upper class users, the problem of 
dimensioning the network to meet the bandwidth guarantees to upper class 
users is a non-trivial problem. Since traffic flows are dynamic; any source can 
generate traffic to different destination^ at different rates and the routes to the 
destinations may also change. Thus it is difficult for the edges to have the 
knowledge of traffic distribution inside the network. To provide any sort of 
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guarantees, it is necessary to provision the network for the worst-case where 
the one assume that all upper class traffic may go through the weakest link in 
the network. 

• Choosing profile. Choosing a proper profile for a user is not a straightforward 
task either. It can be viewed as the reverse problem of provisioning. For a 
given network, how can one decide the best profile users can be assigned to. 
Note that a profile applies to the aggregated traffic flow from a user 
organization going through the entry point to its ISP. When a network does not 
have uniform bandwidth provisioning, profiles are likely to be destination- 
specific. For example, suppose that an ISP has a link to a neighbor ISP with a 
capacity 600 Mbps and a link to the Internet backbone with a capacity of 100 
Mbps. If a user is communicating with another user in the neighbor ISP, the 
user can have a rate limit of 6 Mbps but only 1 Mbps if the user’s traffic goes 
through the backbone access link. In this case, a profile of either 6 Mbps or 1 
Mbps is not appropriate for the user. When a user is sending traffic to 
multiple destinations, the situation becomes even more complicated. 

• Reverse traffic. Both the premium service and assured service proposals 
largely focus on the case where the users send packets towards the ISP. The 
bits in the packet header are set at the entry points to the ISP before mixed 
with packets from other sources, therefore the profile meter only need to know 
the admission control policy for the sender. However, in many cases, the users 
actually pulling the traffic in from the ISP. A typical example of this is Web- 
based applications which usually retrieve information from the Web servers. 
To apply the premium service and assured service models to such reverse 
traffic, the premium/assured bits have to be set at the server side or at the ISP- 
ISP boundary. There are a number of problems. First, it implies that the profile 
meter at an ISP-ISP boundary has to know the admission control policy for all 
its users. Second, if there may be multiple ISP-ISP boundaries, it becomes 
necessary for the profile meters at all boundaries to cooperate in order to make 
sure the sum of all upper class traffic for a user matches its profile. 

• Starvation. With premium class scheme, that the congestion is invisible to the 
premium class, the network will no longer to provide any congestion signal for 
the premium class traffic. For TCP flows in the premium class, the sender’s 
window will grow to the point that all bandwidth allocated to the premium 
class is taken up. If the bandwidth provisioning for the premium class is not 
done with care, best effort traffic will see significant degradation and may be 
starved completely. 

• The profile-based tagging only provides limited protection against 
misbehaving sources. Since in profile-based tagging, the network deals with 
the tagged packets in a FIFO fashion. A misbehaving source can still gain 
more bandwidth by injecting excessive traffic. The problem can be 
aggravated when the fixed profiles are significantly over (or below) the level 
appropriate for the congested links. In such a case, the majority of the packets 
are not tagged (or tagged). Thus the tagging pro-vides little information to 
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enforce differentiation, and network behavior will be close to simple FIFO 
best effort. 



USER-SHARE DIFFERENTIATION 

In this section, we present User-Share Differentiation (USD), a scalable bandwidth 
allocation scheme for differentiated services, and discuss the key design principles 
behind the scheme. 

3.1. Overview 

The USD scheme is designed to provide long-term bandwidth allocation for 
aggregated traffic flows. It differs from the premium service and assured service in 
a number of ways: 

• The USD scheme provides traffic isolation between the customers of an ISP 
rather than between two classes. 

• The primary service model in USD is proportional fairness rather than explicit 
bandwidth guarantee. 

• The USD scheme enforces bandwidth allocation on the bottlenecks where the 
congestion takes place. 

• The USD scheme does not require admission control at the edges of the 
network and it also works well with reverse traffic. 

The USD scheme has two important components: 

• User. A user is the basic entity to which the bandwidth is allocated. The term 
“user” refers to the party with whom an ISP enters a service contract and the 
entity that pays for the ISP’s service. It is important to note that a user is not 
necessarily an end user; it can be a network or a group of networks. 

• Each user is assigned a number called “share” based on how much a user has 
paid for the service. The share is used for determining how much bandwidth a 
user is allocated to. Under congestion, the share is used as the “weight” in the 
allocation of bandwidth. 

The USD bandwidth allocation is carried out in the following steps: 

• At the time when a user subscribes to its ISP for Internet services, the ISP and 
the user agree on the share for the user based on the user’s requirements and 
how much it is willing to pay. The share may change each time when the user 
changes its service contract. 

• At any time instance and any point inside the ISP’s domain, the ISP con 
provide two guarantees to the user. First, a user will have a minimum mount 
of guaranteed bandwidth anywhere inside the ISP (the worst-case guarantee). 
And at any time instance, the amount of bandwidth allocated to a user is 
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proportional to its share among all the active competing traffic on any links 
(proportional guarantee). 




Figure 1: User-Share-Differentiation Information Flows 



• The user and its corresponding share are distributed to some or all routers 
inside the ISP through some network management protocol such SNMP or 
other similar protocols. 

• The USD allocation is policed with a scheduler that supports proportional fair 
sharing (PFS) and is activated whenever a queue builds up. 

We now discuss the key design decisions behind the USD scheme. 

3.2. Flexible Control Granularity 

One of the main issues in any resource allocation is the granularity of the control. 
The finer granularity offers better control of resources but also brings the 
associated complexity in the setup of the state and classification of packets. Part of 
the problem with the EEPS model is its 5-tuple fine granularity and the signaling 
requirement that comes with it. On the other hand, the granularity decides the 
minimum level of control one can excise, and the level of traffic isolation can be 
supported. Therefore, it is an important engineering decision that has to be made. 

In USD, we follow the natural administrative boundaries in the customer-ISP 
contractual relationship and introduce the user as the basic unit that defines control 
granularity. All traffic originated from or destined to a user is aggregated into a 
single flow, and the ISP provides protection for a user’s traffic from other 
competing users. Within traffic of a single user, it is up to the user to decide how 
the bandwidth is used internally. The definition of user is flexible to allow variable 
granularity to meet different requirements. A user is an individual host identified 
by its IP address or it can also be a network identified by the network prefix, and 
an ISP identified by its prefixes. 




358 



As the per-user granularity provides full traffic isolation between users, it takes 
away the incentives for misbehaving. If a misbehaving user ignores the congestion 
signal, and continues to send traffic at unsustainable rates, it can only waste the 
bandwidth that the user is allocated and cause its own packets to be dropped. We 
believe that once the traffic isolation is provided inside the network, users will start 
to deploy intelligent control congestion mechanisms for their own good. 

3.3. Scalable Aggregation 

The definition of user also determines the level of aggregation inside a network. 
Note that the Internet has a hierarchical structure of ISPs, from backbone ISPs to 
retail ISPs and to end users. USD follows the same structure and allows 
hierarchical aggregation. 

When traffic goes across ISP boundaries, the level of traffic aggregation also 
changes accordingly. Within a user’s immediate ISP to which the user has a direct 
contractual agreement, all traffic of the user is aggregated into a single flow. In the 
core backbone, the retail ISP has a contractual agreement with the backbone 
provider on the behalf of all users from the retail ISP. Thus, all traffic from and to 
the ISP is visible in the backbone as one flow. As the traffic moves from the sender 
toward the backbone, the level of aggregation increases while the level of control 
granularity decreases. Such variable levels of aggregation is essential for the 
scalability in the core backbone as the amount of state in a network only depends 
on the number of customers an ISP has direct contractual relationship. 

When a packet traverses the network, the control policy can actually change as 
the user-ISP relationship changes. Take Fig. 2 for example. When a packet from 
user A enters ISP A, the packet is aggregated into the flow to or from user A and 
the bandwidth allocation within ISP A is determined by the share assigned to user 
A (the source address prefix). When the packet goes into ISP B, the destination 
address prefix becomes visible thus the allocation depends on the user B’s contract 
with ISP B. Within the backbone, the packet is aggregated into the flow for ISP A 
or ISP B. Such variable level of aggregation ensures a great deal of scalability and 
is consistent with the contractual obligations for the parties along the path. 

Source Destination 




Figure 2: Variable Levels of Aggregation 
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1.4. Proportional Fair Sharing 

The per-user granularity allows traffic isolation between competing users. We can 
now move onto the policy for allocating bandwidth to multiple competing users. 

In a commercial Internet, it is natural that the bandwidth allocation must be 
linked to how much a user has paid. This way, users can quantify the amount of 
services they get and thus are willing to pay more for better services. There are two 
basic approaches to achieve this goal. One can provide a user explicit bandwidth 
guarantee. Such guarantee would be easily to carry out if it is over a specific path, 
for example, a virtual leased line between two sites for a VPN application. 
However, to guarantee bandwidth anywhere in an ISP is much harder as one has to 
provision for the worst-case scenario. We believe that, for long-term bandwidth 
allocation anywhere in an ISP, guarantees for the relative fairness rather than the 
explicit amount of bandwidth is a more efficient option. In USD, the share reflects 
how big the slice of service that a user has paid for. When congestion occurs, the 
USD scheme allocates bandwidth to all active users at the bottleneck in 
proportional to the shares. For example, if user A and user B have shares of 5 and 
10 respectively, user B will always get at least twice of what user A gets. If a user 
is consuming less bandwidth than it is allocated, the spare bandwidth is allocated 
in proportional to the other backlogged users. We call such a sharing model as 
“Proportional Fair Sharing”. As we discuss in the next section, such sharing policy 
can be easily implemented with WFQ using the share as the weight. 

Note that in the worst-case, the proportional allocation yields the same result as 
explicit allocation. For example, suppose that an ISP has 4 users A, B, C and D, 
sharing an access link of 30 Mbps. The agreed allocation is 4 Mbps, 6 Mbps, 8 
Mbps and 12 Mbps respectively. This allocation can be described with the actual 
bandwidth, 4 Mbps, 6 Mbps, 8 Mbps and 12 Mbps. Alternatively, the allocation 
can also be expressed with relative sharing, 2:3:4:6. When the 4 users are all active 
over the link, the bandwidth allocation is the same. However, the relative sharing 
has a number of advantages. First of all, it can guarantee the same minimum 
bandwidth allocation as an explicit allocation does. Second, it allows the 
bandwidth above the minimum to be shared in proportion to the minimum 
allocation. For example, suppose that user A and B in the previous example are not 
using their allocated bandwidth during a period. Now user C and D can share the 
extra bandwidth in proportion to their relative ratio. The final allocation to user C 
and D becomes 12 Mbps and 18 Mbps. More importantly, the relative sharing 
representation works well with multiple bottlenecks with different bandwidth 
provision. For example, suppose the ISP of 4 users has another link with 600 
Mbps bandwidth. The 4 users who have shares of 2, 3, 4, and 6 will have 
minimum guaranteed bandwidth automatically scaled up to 80 Mbps, 120 Mbps, 
160 Mbps and 240 Mbps respectively. 

The relative sharing can be viewed as a flexible profile as it scales up and down 
according to the bandwidth available whilst guaranteeing the minimum bandwidth. 
In practice, the share can be defined in such a way that the share for a user can be 
easily derived from the minimum bandwidth allocated. For example, if we define 
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the unit of share is 1 kbps, a user with 4 Mbps minimum band-width has a share of 
4000. 



IMPLEMENTATION AND DEPLOYMENT 



Fig. 3 shows the block diagram for the implementation of the USD scheme in a 
router. The bandwidth allocation unit is similar to the IP lookup unit in IP routers 
and it consists of a bandwidth allocation table and a table lookup engine. The 
bandwidth allocation table is a list of user prefix and its associated share. The 
bandwidth allocation lookup engine does a longest prefix match with the source 
and destination of each packet to see if there is a match in the resource allocation 
table. If there is a match for either the source or the destination address, the 
associated share is used in the scheduler as the weight for bandwidth allocation. If 
both the source and destination match, this implies that both the sender and the 
destination are within the ISP. In such cases, the minimum of the two shares will 
be used for scheduling. 



Forwarding Table 







Bandwidth Allocation Table 




Figure 3: Block Diagram for the USD Implementation 

To support USD, routers need to implement a scheduler that supports proportional 
fair sharing. There is a wide range of scheduling algorithms that meet such 
requirements. For example, Weighted Fair Queuing (WFQ) is such an algorithm 
that has been extensively in recent years (Parekh and Gallager 1992). Although 
the original WFQ is expensive to implement, several variations of WFQ have been 
proposed that support band-width sharing in similar fashion but are optimized for 
software and hardware implementation (Bennett and Zhang 1996, Shreedhar and 
Varghese 1995, Stiliadis 1996). Some of the algorithms can emulate WFQ closely 
with 0(1) complexity (Shreedhar and Varghese 1995). 

USD enforces bandwidth sharing locally on the bottleneck links. Thus it does 
not require any changes to the end systems and any admission control at the user- 
ISP boundaries. Consequently, USD can be deployed an incremental fashion. In 
fact, routers can be upgraded to support USD individually and each upgrade gives 
incremental improvement to the whole network. For example, when USD is 
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installed on the router connected to the access link to the backbone, bandwidth 
allocation is enforced immediately for all traffic that is going through the access 
link. Moreover, USD only needs to be deployed at the points in the network that 
are heavily congested. Once bandwidth sharing is enforced at those points, other 
links may not require further policing. 



CONCLUSIONS 

In this paper, we examined the problems with the Premium service and Assured 
service proposals for the differentiated services, and presented the details of the 
USD scheme. We conclude that although the USD scheme requires more 
sophisticated support in the core network routers, it provides both minimum 
bandwidth guarantees and proportional fair sharing across the network and under 
various provisioning. 
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Abstract 

The attempt to provide QoS in BP networks has raised some interesting questions on 
how a service can be provided to meet the application requirements while obeying 
the network resource constraints. Previous efforts focussed on a flow-based, con- 
nection oriented approach to deliver QoS for IP Networks - Intserv. This approach 
was quite comprehensive but it has not been widely deployed because of complex- 
ity and scalability issues. A recent packet marking based scheme called Differenti- 
ated Services (Diffserv) Architecture provides a relatively simple and coarse 
approach. It is too early to predict the usefulness of this approach. This paper out- 
lines a framework to deliver IP QoS which is based on Intserv. It addresses scalabil- 
ity concerns by removing the need for a connection-oriented reservation setup 
mechanism and replaces it with a Diffserv-like mechanism to consistently allocate 
bandwidth end-to-end in a network. A prototype device is discussed that manages 
bandwidth on a node. An algorithm is presented that allows the device to automati- 
cally detect application QoS requirements without the need for application-level 
signalling. A priority-based scheduling mechanism with a variant of weighted 
round-robin is described. 
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1 INTRODUCTION 

The Internet has traditionally offered services in a best-effort manner. This is 
acceptable in an environment where congestion due to bandwidth requirement is 
seldom. Moreover, most of the traditional applications like email, file transfer, news 
are relatively insensitive to delay and delay variations. However, recent surge in 
Internet popularity has caused the bandwidth to be at premium. The new applica- 
tions like streaming video have high bandwidth requirements as well as delay con- 
straints. Web access, real-audio etc. require a service which is better than best- 
effort. Adding more bandwidth is no longer a solution. Thus, a mechanism is 
needed for bandwidth sharing and prioritization. The idea of prioritizing traffic is 
also in tune with the approach of commercializing internet applications i.e., to get 
better service, one needs to pay more. 

A mechanism for bandwidth management is needed by which the finite available 
network resources can be shared among various applications in a manner suitable to 
the specific applications and also following the guidelines from the administrator. 

1.1 Providing QoS via Integrated Services (Intserv) Architecture 

Initial efforts on providing QoS for IP networks focussed on the Intserv (Integrated 
Services) model. This model relied on a flow-based, connection-oriented approach 
to deliver QoS for IP networks. An overview of Intserv is provided in RPC 1633 
[1]. In this RFC, Intserv extends the current IP architecture so that it provides QoS 
to end users. The extension includes: (i) A service model with two services (ii) A 
reference implementation framework to provide support for QoS-enabled routers. 

The network element behaviour required to support the two services are outlined 
in: (i) RFC 2211: Controlled Load [2] and (ii) RFC 2212: Guaranteed Service [3]. 
The reference implementation framework provides implementation-level detail to 
realize the above two services. 

The framework [1] proposed to provide QoS support in routers includes the fol- 
lowing four elements: (i) classifier, (ii) scheduler, (iii) admission controller, (iv) res- 
ervation setup protocol. RSVP is used as the reservation setup protocol of choice 
[4]. 



1.2 Providing QoS via Differentiated Services (Diffserv) Architecture 

The Intserv model is quite mature and much work has been completed. However, it 
has not been widely deployed due to a variety of concerns. A primary issue is that 
of scalability [6]. With Intserv, intermediate routers need to save per-flow state 
information. Another concern is the end-to-end connection-oriented approach of 
RSVP [4] which is foreign to IP networks. The above issues result in an architecture 
that is complex to implement and deploy. This might be the single most important 
reason why the new Differentiated Services initiative has started. 

The Differentiated Services Architecture (Diffserv) attempts to address the above 
concerns by operating on the premise that it is beneficial to move complexity to the 
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edge of the network and keep the core simple [12]. It intends to provide differing 
levels of service in the Internet without the need for per-flow state and signaling at 
each router. The framework includes the following elements: (i) A Traffic Condi- 
tioning element at the edge of the network that performs the following: marks pack- 
ets to receive certain levels of service at the backbone, polices packets and performs 
traffic shaping to ensure that packets entering the backbone conform to network 
policies (ii) Core routers that treat packets differently depending on the packet 
marking completed by the edge device (iii) Allocation/policy mechanism that trans- 
lates into end-to-end QoS levels of service seen by the end-users [12]. 

1.3 Our Work 

This paper discusses a scheme for providing end-to-end QoS which is driven by 
application requirements. The approach is based on Intserv but does not utilize 
RSVP, thus addressing the scalability concerns [6] expressed about the Intserv 
model. Our approach consists of the following: The first element is a device called 
the Traffic Conditioner which manages the bandwidth at router junction points. The 
second element is a connectionless mechanism for ensuring consistent end-to-end 
delivery of QoS based on application requirements. 

The Traffic Conditioner is based on the reference implementation framework 
described in RFC 1633. It contains three of the elements required to manage band- 
width on a particular router junction point: (i) Classifier (ii) Admission Controller 
(iii) Scheduler. However, instead of RSVP, it utilizes a scheme proposed in [5] to 
automatically discover QoS requirements for traffic flows and services them 
accordingly. 

In the RSVP model, host machines would initiate connection-setup end-to-end 
before engaging in data transfer. This connection-setup is used to communicate 
application QoS requirements and determine if the necessary network resources are 
available before starting data transfer. This type of pre-negotiation is not intrinsic to 
IP networks. 

The Traffic Conditioner automatically and dynamically meets application require- 
ments without the need for pre-negotiation. Instead, it performs on-the-fly traffic 
characterization to put network traffic into different classes which are then serviced 
by a scheduler in such manner that reflects application requirements for that class of 
traffic. 

Having discovered the requirements for a particular flow, and classified the packet 
accordingly, the Traffic Conditioner marks the packet with its class. 

The second element required is a mechanism to ensure consistent allocation of 
bandwidth across all routers to provide end-to-end QoS. This is an area of ongoing 
research and is discussed further in Section 3.0. The connectionless approach uti- 
lized is similar to the scheme used in the Differentiated Services Architecture out- 
lined earlier. 

The paper is organized as follows: Section 2 describes the functional detail of the 
Traffic Conditioner including the classification, admission control and scheduling 
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schemes utilized. Section 3 outlines the necessity of consistent bandwidth alloca- 
tion mechanism. Section 4 presents the experimental results gained from imple- 
menting the prototypical Traffic Conditioner. Section 5 contains the conclusion and 
discusses future plans. 

2 TRAFFIC CONDITIONER: FUNCTIONAL DETAIL 

With the advent of new applications, the traffic pattern in IP networks (Internet) has 
changed over the years. The present traffic at the IP networks can be broadly 
divided into three categories in: (i) Real-Time, (ii) Interactive and (iii) Bulk. The 
real time voice and video requires an uninterrupted data stream which keeps the 
maximum delay variation within a limit. Video conferencing requires the total delay 
to be within a limit so that the human interaction is not affected. The traffic gener- 
ated from Telnet, X-windows and web browsing are also interactive in nature and 
requires a good response time. Another category is bulk transfer like file transfer 
and NFS (Network File System) backup which do not have any stringent delay vari- 
ation requirement. At the same time, this bulk traffic should not cause congestion 
for other classes of traffics. 

Figure 1 shows a Traffic Conditioner functional model. Traffic Conditioner per- 
forms bandwidth management on an individual packet stream (i.e., the packets 
flowing between two applications in client and server). Each packet belongs to a 
flow. Flows are uniquely identified by source and destination ip addresses, transport 
level port pairs and protocol type. Flows are important, since the classification is 
performed by the characteristics of a stream of packets between a pair of users. 

Any packet arriving at the input of the Traffic Conditioner is associated with a flow. 
All the flows are maintained in a flow-list. If the arrived packet does not belong to 
any existing flow in the flow-list, a new flow entry is attached to the flow-list. The 
packet is queued at the scheduling class queue for the flow’s class. The scheduler 
outputs the packets based on the class and the scheduling strategy. The real-time 
data path (as shown in Figure 1) has two major functions: (i) Identify the flow for 
the input packet and (ii) schedule the packet to the output. A flow entry to a flow-list 
is removed if there is no packet arrival to the flow for a fixed time (e.g., two sec- 
onds). 

The background functions (as shown in Figure 1) are to (i) classify the flows, (ii) 
admission control and (iii) perform bandwidth estimation of different classes of 
traffic. The classification of each flow is performed on the fly based on traffic char- 
acteristics (e.g., bit rate, packets per second etc.). The classification is performed 
periodically on all the flows in the flow-list. The packets arriving at the input before 
the proper flow classification are scheduled with default class. 

The bandwidth estimator updates the bandwidth usage on each class periodically. 
This is necessary for the admission controller and the classifier at the classification 
time. An embedded HTTP server is provided for monitoring and setting parameters 
by the administrator. 
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FIGURE 1. Traffic Conditioner Functional Blocks 



2.1 Flow Classification and Admission Control 

The paper [11] proposes a hierarchical link sharing mechanism to address the 
requirements for realtime traffic in presence of other classes of traffic. The classes 
in the hierarchical link-sharing structure can be multiple agencies, multiple applica- 
tions, multiple protocols etc. and the mechanism used for link sharing is called class 
based queueing (CBQ). The experimental results reported in [9], show that the CBQ 
based link sharing mechanism between CBR type of traffic with two different prior- 
ity of TCP traffics. It also shows the delay behavior of the traffics in the shared link. 

The Traffic Conditioner classification strategy is based on the scheme proposed by 
[5]. The flow classification for both TCP and UDP flows are performed on the basis 
of different treatments required for different traffic types i.e., the application 
requirements. Thus, different applications with the similar service requirement will 
fall under the same class. These classes can be different nodes in the hierarchical 
link-sharing structure as proposed in [11]. 

The idea behind classifying TCP flows is to separate traffic requiring fast response 
from the delay insensitive bulk transfer. The UDP flows are classified to differenti- 
ate between traffic requiring (i) low latency and low bandwidth, (ii) low latency and 
high bandwidth and (iii) delay insensitive bulk transfer. The classification of a traf- 
fic flow is performed on the basis of traffic characteristics rather than identifying the 
well known ports (e.g, 80 for web) although that can be used to assist the classifica- 
tion. 
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The Traffic Conditioner divides traffic into the following classes: 

Interactive: The TCP flows with short packets and requiring short round trip time 
are captured in this class. The applications like, Telnet, web browsing and interac- 
tive X- windows etc. will fall in this category. The objective here is to protect a por- 
tion of the total bandwidth to ensure a reasonable response time. A short packet is 
defined as a packet with less than or equal to 128 bytes. 

All the TCP flows are classified as interactive at the beginning. If the number of 
continuous long packets exceed a threshold (e.g., 200) without a string of two or 
more short packets, the flow is moved to bulk transfer class. If the class bandwidth 
is available, the class is considered as Bulk Transfer with Reserved bandwidth. Oth- 
erwise, the flow class is Bulk Transfer with Best Effort. 

Bulk Transfer with Reserved Bandwidth : The TCP flows with continuous long 
packets are captured in this class. Applications like, large FTP, Web image transfer 
will fall in this class. It is ensured that the bulk flows get a certain portion of the 
total bandwidth and at the same time bulk traffic should not encroach in allocated 
bandwidth of other classes of traffic. 

If there are two or more continuous short packets in the stream of long packets, 
the flow is reverted back to Interactive class. 

Bulk Transfer with Best Effort: The TCP flows with continuous long packets but 
no class bandwidth available at the reserved category falls in this class. This traffic 
is scheduled on best effort basis. 

If bandwidth becomes available, the flow makes a transition to Bulk Transfer with 
Reserved Bandwidth class. If there are two or more continuous short packets in the 
stream of long packets, the flow is reverted back to Interactive class. 

Low Latency: The UDP flows with low packet rates are captured in this class. The 
applications like real audio, interactive voice, NFS requests and short replies, DNS 
(Domain Name Server) transactions fall in this category. The objective of this class 
of traffic is to treat it with high priority so that the latency remains low. 

All the new UDP flows are classified as Low Latency at the beginning. The flow is 
moved to UDP Best Effort if the packet rate exceeds packet rate threshold or no 
bandwidth is available at Low Latency class. 

Best Effort: The UDP flows with high packet rate but not classified as Real Time 
class falls in this class. NFS file backup is one application belongs to this category. 
The traffic is transferred in a best effort basis. 

The flows matching the Real Time template are moved to the Real Time class if 
that class has available remaining bandwidth. Flows can also have a path back to 
Low Latency class. 

Real Time : The UDP flows like streaming video and NFS based video are in this 
class. The traffic is handled with high priority so that the latency remains at the min- 
imum. The reason for distinguishing Low Latency and Real Time is to preserve the 
bandwidth for Low Latency class of traffic. 

The streaming real time traffic arrives at a constant rate (with a slight variation 
due to network delay) at the receiver. The distribution of packet interarrival times is 
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uni-modal. On the contrary, NFS based file transfer, sends a fixed number of pack- 
ets before it waits for acknowledgment from the application layer. It means that the 
distribution of packet interarrival time for NFS based application is bi-modal which 
is different from unimodal distribution of Real Time class of traffic. Also, if the 
packet rate of the flow in Real Time class exceeds a certain threshold (which makes 
it as unlikely to be video), the flow is reverted back to UDP Best Effort class. If the 
flow is idle for more than a second, its class is changed to Low Latency. 

As mentioned in [10] to provide bounded delay service, networks must use admis- 
sion control to regulate the load. In this work, the measurement based admission 
controller enforces a limit on the flows which require high bandwidth and high pri- 
ority scheduling. The admission control is enforced on the Real-time flows based on 
the measurement of link usage. Bandwidth Estimator periodically updates the band- 
width usage of Real-time class. Any new Real-time flow is admitted (i.e., allowed 
to be serviced) only if the new bandwidth requirement of the class is within the 
administrator specified limit. Otherwise, the flow is marked as reject. Packets of 
flows belonging to reject class are dropped by the traffic conditioner. An ICMP 
(Internet Control Message Protocol) host unreachable message is sent back to the 
source host to stop the flow. 

2.2 Scheduling 

A key component in bandwidth management of a link is a scheduling mechanism. 
In the work [1 1], the usefulness of priority scheduling mechanism for link sharing is 
studied. The scheduling of the classified flows in Traffic Conditioner is performed 
with two priorities: high and low. The high priority traffic is scheduled without any 
delay at the scheduling queue and limited by the admission control mechanism. The 
low priority traffic is scheduled according to a set of rules based on the allocation of 
bandwidth and a criteria for sharing bandwidth with other classes of traffic on the 
link. Traffic in the Real Time and Low Latency classes are handled with high prior- 
ity. Traffic in the Interactive, Bulk Transfer with Reserved Bandwidth and Best 
Effort classes are scheduled with a low priority. The ideas are: to protect a portion 
of bandwidth for Interactive class of traffic to guarantee a low round trip time, to 
limit the delay insensitive Bulk Transfer traffic to a specified limit so that it does not 
hog bandwidth from other classes. The rule based approach for scheduling of 
classes with low priority will achieve a similar goal of weighted round robin [11], 
with weights proportional to the combination of allocated bandwidth per class and 
delay sensitivity of the class. 

A scheduling window of T seconds (1 sec. for this implementation) is chosen. The 
bandwidth allocation for each class during T seconds is proportional to the adminis- 
trator specified bandwidth values. The window is divided to sub-windows of a 
smaller t milli seconds intervals. The scheduler wakes up on every t seconds and 
schedules the packets arrived at different class queues during t seconds with the fol- 
lowing rules: 
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1 . All the high priority packets (UDP Low Latency and Real Time) are transmitted 
until there are no packets remaining in these queues. The queues are served in a 
round robin fashion. 

2. The packets at Interactive and Reserved Bandwidth classes are transmitted only 
if there is no remaining high-priority packet at the class queue. These two 
queues are served in a round robin fashion. 

3. The packets from Reserved Bandwidth class queue are transmitted only if the 
allocated bandwidth limitation for the class during scheduling window of T sec- 
onds is not exceeded. 

4. The packets from Interactive class queue are transmitted even if the allocated 
bandwidth for the class is exceeded (in a scheduling window of T seconds) but 
there is no packet waiting at the Best Effort and Reserved Bandwidth classes 
and bandwidth is available in those two classes. 

5. The best-effort class of packets are transmitted if there is no packet in any other 
class of traffic and bandwidth is available for this class. These packets are also 
transmitted if its allocated bandwidth is exceeded (in a scheduling window of T 
seconds) but there are no remaining packets at the Reserved Bandwidth class 
and bandwidth is available for this class. 

6. Packets from Reserved Bandwidth class and Best Effort class are not allowed to 
borrow bandwidth from Interactive class of traffic. Thus, the interactive band- 
width is preserved. 

The scheduling class queue length at the traffic conditioner for different classes of 
traffic are different. The queue length will never grow for high priority traffic. The 
queue length does not grow for interactive traffic, since the bandwidth is preserved 
for this class and allowed to steal bandwidth from other classes. Queue length of 
Bulk Transfer with Reserved Bandwidth and TCP Best Effort will grow with traffic 
but TCP adjusts to keep the queue length minimal. The queue length of UDP Best 
Effort has the possibility of growing if the traffic exceeds its allowed limit. Thus a 
congestion control mechanism similar to drop tail is implemented to restrict the 
excessive traffic for this class. 

3 CONNECTIONLESS APPROACH FOR END-TO-END 
BANDWIDTH ALLOCATION 

The Traffic Conditioner is one element in the connectionless approach to QoS in IP 
networks. Ongoing investigation and experimentation is being pursued to study the 
second component, i.e. providing a mechanism to emulate end-to-end reservation 
setup. The issue of providing QoS similar to Intserv’s Guaranteed Service Model 
[3] without using an RSVP-like resource reservation mechanism, remains an open 
research area. It is not clear whether or not the Differentiated Services initiative will 
be able to provide this service. 

However, we believe that it should be possible to achieve the Controlled Load 
Service model [2] using the approach outlined in this paper. In order to do this a 
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mechanism is required to treat flows in a consistent manner in all routers along the 
path from source to destination. To avoid the pitfalls of RSVP, this should be 
achieved without per-flow state saving at each router. 

This can be achieved in a well-engineered network with packet marking by the 
Traffic Conditioner. The administrator statically preallocates bandwidth for each of 
the defined classes in a consistent manner across the network. This can easily be 
achieved in an intranet environment. However, in the internet, the solution evolving 
out of the Diffserv initiative should be applicable. Proposals have been made that 
discuss dynamic allocation of bandwidth in a connectionless environment [7]. 

4 EXPERIMENTAL RESULTS 

The Traffic Conditioner (TC) was implemented on a Pentium 200MHz PC running 
VxWorks as the RTOS (Real Time Operating System). The current implementation 
utilizes a 4-port OSICOM PCI ethemet card. One port is used for network manage- 
ment. The other two ports connect to the link that the Traffic Conditioner is 
conditioning. The TC also contains Mombasa - an embedded web server that is 
used to facilitate device network management. With the inclusion of Mombasa, net- 
work administrators are able to manage the Traffic Conditioner using any standard 
HTTP web browser, e.g. Netscape or Internet Explorer. Currently, Mombasa is used 
for configuration and statistics monitoring. 

Although this study would have benefitted from a hardware implementation, the 
software implementation of the TC was able to support traffic levels on a lOBase-T 
ethemet LAN. 




FIGURE 2. Experimental Setup 

For its operation, the TC sets the ethemet cards in promiscuous and NSAI modes. 
Promiscuous mode allows it to take in all packets on the line. NSAI mode prevents 
the ethemet card from stamping its MAC address as the source. This allows the TC 
to operate transparently on the link. 
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The implementation model used by the TC closely reflects that depicted in 
Figure 1 with the Flow management, and Scheduling processes carried out as fore- 
ground processes and the classification carried out as a background process. 

As mentioned previously, RFC 1633 [1] provides a reference implementation 
model for a QoS-enabled router. The intention of this set of experiments was to 
study the QoS model elements referred to in that RFC. To do this, the TC was 
developed as a stand-alone device intended to perform bandwidth management over 
a single link. The rest of this section describes the results obtained when the Traffic 
Conditioner was deployed on the network in the presence of various types and rates 
of traffic 
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FIGURE 3. Interarrival Times - without TC; Background Load - 0.5Mbps 

The experimental setup is depicted in Figure 2. As the Figure shows, the TC was 
deployed in a live network. To carry out the tests, varying levels of network traffic 
needed to be created. A deterministic traffic generator was used to generate traffic 
with varying packet lengths and sending rates for both UDP and TCP. As the Figure 
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2 shows, the setup consisted of a Silicon Graphics video server, a File Server, a 
Windows 95 computer with streaming video client software, Linux2 which served 
as the Traffic Generating computer, Linux 1 which served as the sink computer for 
the traffic generator, and Linux 3 which was used as Web server. 
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FIGURE 4. Interarrival Times - without TC; Background Load - 6.2 Mbps 

A key outcome of the experiments below was the verification of the classification 
scheme proposed in [4]. Various types and rates of traffic were generated and it was 
observed that all fit into expected classes. The experiment was performed with the 
following applications: ftp, telnet, streaming video, real audio, web browsing (small 
and large files), DNS and NFS based file transfer. 

Experiment 1 - Behavior of Video Traffic in presence of network traffic 

The goal of the 1st experiment was to explore how real-time traffic such as video 
behaved in the presence of network traffic. It was desirable to observe its behavior 
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in scenarios with and without the TC. Video sessions were started from the Win95 
computer. This resulted in around 1.4 Mbps of streaming video being transferred 
from the SGI video server to the Win 95 video client. Background network traffic 
was generated with Linux2 as the source and Linux 1 as the sink. 

Three separate tests were carried out. Packet interarrival times were measured for 
the video stream in the following test cases: 

1. Without Traffic Conditioner; Low network traffic rates 

2. Without Traffic Conditioner; High network traffic rates 

3. With Traffic Conditioner; High network traffic rates 
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FIGURE 5. Interarrival Times with TC; Background Load - 6.2Mbps 



In all cases, the experiment was run for 3 minutes and around 25,000 samples 
obtained. Measurements were obtained on the TC interface connected to Ethernet 
Hubl. The results of the three tests can be seen in Figures 3, 4 and 5. Each Figure 
plots a histogram of the packet interarrival times for the video traffic. 

Figure 3 shows the packet interarrival distribution for a network with very low 
traffic levels. The distribution is a clear reflection of video traffic characteristics. 
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Two clear peaks can be seen on the distribution. There was a third peak at around 80 
ms but that set of data was filtered out for all three Figures as it is not much of com- 
parative value. The peak around 1.25 ms accounted for almost 84% of the 25,000 
data samples. The other peaks at 0. 1 and 80 ms appeared equal in height and width. 
This reflects the nature of traffic from this particular video stream which we verified 
sends packets in 12-packet cycles - 10 packets with a delay of 1.25 ms, 1 packet 
with a delay of 0.1ms and 1 packet with a delay of 80 ms. 

The next testcase involved the same video session but also, introduced a back- 
ground network traffic load of 6.2 Mbps from the Linux2 to the Linuxl computer. 
The measurements from this test can be seen in Figure 4. As can be seen from the 
Figure 4, the height of the main peak reduced greatly and its width expanded con- 
siderably. Quantitatively, the width expanded from 0.23 ms to 0.45 ms, almost dou- 
bling in size. This kind of effect on video traffic is not desirable as it affects the 
quality of images that is received by the client. In addition, four small peaks appear 
due to the contention of background and video traffic at the physical layer (ether- 
net). 

The final testcase in this experiment involved the deployment of the TC (as shown 
in Figure 2 on page 6) and the launching of a video stream in the presence of the 
same 6.2 Mbps background network traffic. The results of the measurements are 
displayed in Figure 5. As can be seen, the main peak has a much higher height than 
that of testcase 2 (Figure 4). Also the width of the peak has reduced from 0.45ms to 
0.3ms - a reduction of 33%. Secondly, the two peaks on the right side of the main 
peak have been consolidated into a single small peak with minimal width. The TC 
is unable to remove this peak because it does not have control over the delays that 
occur as a result of high collision rates between the video and background traffic at 
the physical layer. On a final note, the small peaks to the left of the main peak have 
virtually been eliminated. The results of Figure 5 are quite encouraging as they 
reveal the TC’s ability to improve the quality for a video stream in the presence of 
heavy network traffic. 

Figure 6 shows the comparison of the interarrival standard deviations for different 
levels of network load with and without the TC deployed. As can be seen from Fig- 
ure 6, the two lines diverge with increasing background load. As expected, the TC 
appears to be showing more value as the background traffic load increases. 

Experiment 2 - Behavior of TCP Interactive in the presence of network traffic 

FTP sessions were used to setup TCP connections between Linux 1 and file server 
(see Figure 2). Without the traffic conditioner, it was seen that transfer of large files 
consumed average bandwidth of 488 kbps. 

The same test was repeated with the Traffic Conditioner deployed. The adminis- 
trator specified rate for the TCP Guaranteed class of traffic was set to 300 Kbps. 
The file transfer continued to attempt transferring files as fast as it could. For the 
first few seconds, the data transfer rate for the flow remained at 488 Kbps. 
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FIGURE 6. Standard Deviation of Packet Interarrival Times for Video 

However, the mechanism of the Traffic Conditioner based on queueing delays 
worked to effect end-to-end TCP rate control. 

The result being that the application adjusted its sending rate of data to the 
amount that the Traffic Conditioner would allow through the pipe. The administra- 
tor specified rate was 300 Kbps and measurements at the Traffic Conditioner output 
showed that the flow can be clamped down to 292 Kbps - to an accuracy of 97% 
percent. The maximum queue length for this class was observed to be 11. 

The experiment was also performed with two ftp sessions attempting to transfer 
large files. It was observed that the two flows shared the available bandwidth of 300 
Kbps for this class. One flow was transferred at 130 Kbps and the other at 166 
Kbps. The maximum queue length was observed to be 21 for this class of traffic. 

Exaerimcnt 3 - PghaYiar of TCP Interactive ia the presence of ngtwi>rk tcaffic 

The goal of this experiment was to see whether or not the round trip time of TCP 
Interactive class of traffic suffered in the presence of background traffic. A software 
tool (HTTP requester) was used to generate HTTP requests to retrieve a file from a 
web server. The web server runs on the Linux 3 (Figure 2) machine. Video was 
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played at the background to generate high priority traffic. Also, traffic of UDP Best 
Effort class is generated as background traffic. 



Average Round Trip Time /ms 




VitfaTC 
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FIGURE 7. Average Round Trip Time for TCP Interactive Traffic 

The round-trip time for each HTTP request was measured to represent a quantita- 
tive indicator of user-experienced delay when using a web browser. The measure- 
ment was performed under varying network load. Each measurement is the average 
of 100 HTTP requests’ round trip time. Figure 7 shows the round trip times as a 
function of network load. It is observed that the RTT is lower when TC is deployed. 
This is due to the fact that a certain bandwidth is preserved for the Interactive class 
of traffic. The impact of TC is more prominent at high background load. However, 
it is debatable if a reduction of 2 mili-seconds of RTT has any impact on the user 
perception of web browsing. 
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5 CONCLUSION AND FUTURE WORK 

The key contribution of this work is to show that it is possible to automatically dis- 
cover the application QoS requirements without RSVP-like pre-negotiation. Fur- 
ther, experimental results have shown that the on-the-fly classification scheme 
proposed in [5] can successfully classify the current traffic in IP networks. This 
implementation has also demonstrated that the combination of classifier, scheduler 
and admission controller can effectively condition and manage traffic on a link. It 
has been shown that the different classes of traffic can be serviced in a “required” 
manner in the presence of background load. 

The Traffic Conditioner is one building block in providing QoS capability in IP 
networks. Investigation and experimentation is needed for other building blocks. 
The other key building block is a mechanism for providing end-to-end allocation of 
service without a per-flow connection-setup protocol. This is an area of on-going 
work. 

One open issue is to understand if the flow-based connectionless approach out- 
lined in this paper is scalable. Although this approach removes the scalability issue 
associated with per-flow state saving at routers, there is still an issue with the num- 
ber of flows that need to be handled by Traffic Conditioning elements. A second 
area of exploratory work is to investigate the suitability of the Traffic Conditioner 
as an edge device [8] in a Differentiated Services network [12]. 
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Abstract 

This paper describes an implementation of an application level gateway for 
connecting adaptive applications using hierarchical encoded video across ATM 
and IP networks. The gateway participates in RSVP and ATM signaling. The 
signaling information, as well as local information about processing load, is 
used by the receivers to decide the number of layers to join, and by the sources 
to fine tune the bitrates of the layers to the available capacities. Our approach 
pushes layering related complexity to the edge of the network, and allows us 
to use standard ATM UNI and RSVP signaling. The gateway participates in 
a modified session directory (SDR) protocol, to learn the addressing informa- 
tion necessary to perform signaling translation, and to enable layered sessions 
to be visible across the IP /ATM boundary. By considering all aspects of the 
problem, especially session directory issues and dynamic bandwidth selection 
for the layered hierarchy, we have implemented a system that is much more 
complete than any of the previous prototypes of layered multicasting. This pa- 
per describes the implementation experience and presents some measurements 
of the performance of the gateway. 

Keywords 

Layered multicast, Internet Protocol (IP), Asynchronous Transfer Mode (ATM), 
Gateway, IP/ ATM interoperation. 
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1 INTRODUCTION 

In this paper, we consider multicast video applications that use hierarchical 
encoding to handle network heterogeneity, using signaling protocols to probe 
the network for available capacity. Multicasting is a powerful network abstrac- 
tion to support point-to-multipoint and multipoint-to-multipoint communica- 
tion. Every packet sent to a multicast group is delivered by the network to 
all the receivers of the group. Both Internet Protocol (IP) and Asynchronous 
Transfer Mode (ATM) networks support multicasting, since it is more efficient 
than multiple point-to-point connections for the same purpose. 

Network environments are heterogeneous by nature. This heterogeneity 
comes from many sources such as link capacity, end stations processing power 
and display resolution, network protocols and level of Quality of Service (QoS) 
support. Heterogeneity is especially problematic for multicast applications, 
since the receivers may not agree on the data rate that they want to receive, 
or the protocol to use to signal QoS requirements to the network. 

Shacham (Shacham. 1992) has proposed multicast transmission of layered 
video as a solution to data rate heterogeneity. The data is encoded into a 
low resolution base stream and a series of enhancement streams. This allows 
different receivers to receive data from the same source at different rates, 
simply by subscribing to different numbers of multicast streams, identified by 
multicast address in IP or multicast virtual circuit (VC) in ATM. 

The problem of protocol heterogeneity can be solved by mandating a uni- 
versal protocol, such as IP. However, this requires unnecessary translations, 
and prevents applications from taking advantage of the native signaling, when 
the communication is restricted to a single signaling domain. An alternative 
approach is to perform translation of signaling and QoS semantics at the 
network boundaries. This approach allows applications in different networks, 
such as IP and ATM, to communicate with each other, while continuing to 
take advantage of the native signaling or resource reservation protocols lo- 
cally. A possible future network scenario involving this approach is a video 
on demand system, with the sources being video studios directly connected 
to a high speed ATM backbone, some high end (perhaps HDTV) clients con- 
nected directly to the ATM network, and some lower end clients connected 
over a slower speed IP-based network. There is no need to involve IP software 
overheads in the data transmission path from the source to the HDTV clients, 
but at the same time IP provides access to a more heterogeneous and broader 
set of clients. This is the model we assume for the rest of the paper. 

The paper proceeds as follows: Section 2 presents related work and moti- 
vation. Section 3 provides some background on application and session layer 
issues. Section 4 describes the gateway implementation. Section 5 discusses 
the implementation, the testbed, and the performance of the gateway. We 
conclude the paper with a summary in Section 6. 
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2 RELATED WORK AND MOTIVATION 

This paper describes an implementation of a gateway to allow adaptive lay- 
ered video applications to communicate across different protocol domains, 
specifically IP and ATM networks. Our work is related to prior work on Het- 
erogeneous Multicast (HMC) (Sudan et al 1997) with a IP/ATM gateway 
for layered video, but differs in certain key areas that we describe below. In 
addition, the applications considered in HMC are nonadaptive. Our work is 
also similar in many respects to Receiver-driven Layered Multicast (RLM) 
(McCanne et al. 1996), which uses receiver adaptation based on packet loss 
to select the number of layers to receive, but is restricted to an all IP environ- 
ment. Thus, many of the new problems we face are related to the handling of 
adaptive applications in a multiprotocol signaling environment. 

Both of the above systems only allow the receiver to select from a static 
set of layers, based on network bandwidth availability. If the set of band- 
widths being transmitted is not well tuned to the set of bandwidths available, 
these systems perform poorly. We believe that it is unreasonable to require 
the user to know a priori the set of network and receiver capacities required, 
and configure the sources with the correct set of layer bitrates. We implement 
feedback mechanisms across the network, so that the bitrates transmitted on 
the layers from the sources are adapted to the set of network bandwidths 
and receiver capacities dynamically. This functionality is orthogonal to the 
functionality provided by SCUBA (Amir et al. 1997), where the information 
from different sources is dynamically mapped on to a static set of layers in 
response to receivers expressing interest in particular sources. Our adaptation 
model also allows us to push all layering related complexity to the application 
layer, using standard User to Network Interface (UNI) (UNI 3.0. 1993) sig- 
naling within the ATM network and ReSerVation Protocol (RSVP) (Zhang 
et al. 1993, Braden et al. 1996) signaling within the IP network, instead of 
modifying the protocols to handle layering as suggested in HMC. 

Previous work in layered multicast has neglected the problem of session ad- 
vertisement. The session directory information provided by SDR (Handley et 
al. 1995) allows a receiver to learn of the existence of a session, and provides 
sufficient information (such as multicast addresses and port numbers in IP) 
so that the receiver can join the session. In an IP/ ATM layered environment, 
the problems of how layered sessions are specified in session advertisement 
messages, how receivers in one domain learn about sources in the other do- 
main, or how to reconcile the multipoint-to-multipoint nature of IP multicast 
with the point-to-multipoint nature of ATM virtual circuits (VCs), have not 
been previously dealt with. For example, in Sudan’s work the gateway must 
be manually configured with a static mapping between layers, IP multicast 
addresses, ATM addresses, participant identifiers (independent of domain spe- 
cific addresses), and traffic descriptors for the layers (in the ATM to IP case). 

Our work addresses these problems in a more general way, by extending 




386 



the Session Directory (SDR) protocol to handle layered sessions, ATM ad- 
dresses, and the point-to-multipoint nature of ATM multicast. The gateway 
participates in the session directory protocol in both domains and performs 
session advertisement translation. These issues are particularly important for 
our system, since the applications are adaptive and use the session directory 
protocol to advertise changes in the structure of layers transmitted by the 
sources, in response to feedback about network and receiver heterogeneity. 

The approach of RLM is targeted to an all IP environment, and does not 
depend on the existence of signaling protocols (e.g., RSVP). If RSVP signaling 
is available, receiver adaptation can be much more stable than in RLM, while 
at the same time responding very quickly to available capacity. In RLM, 
stability and response time must be traded off against each other, depending 
on complex interactions between the duration of congestion and the time the 
network takes to prune a connection after a receiver leaves. Since prune times 
of currently deployed multicast routing and group membership protocols in 
the Internet are long (Gupta et al. 1997), a method like RLM must necessarily 
have poor response times to be stable (McCanne et al. 1997). In addition, using 
RSVP and ATM UNI signaling allows us to provide QoS guarantees. 

Finally, Sudan et al. do not consider the case of connecting an ATM network 
to an IP network without RSVP support. We handle the case where RSVP 
support is not available, by using a loss based mechanism adaptation similar 
to RLM. We help to handle some of the stability and latency problems raised 
by this approach, by performing priority service based on layer number at 
the gateway, to concentrate loss due to congestion on the highest layers. This 
provides a clearer signal to the receiver to adapt, while reducing the impact 
of the loss on the received video quality. 

Our applications are based on Zakhor’s software codec (Taubman et al. 
1994), which is capable of encoding digital video into a very large number of 
fixed size sublayers. Previous work (Banerjea et al. 1997) showed how these 
can be combined dynamically to a smaller number of transport layers, allowing 
the bitrate to transport layer mapping to be adapted by the source in response 
to network feedback. We have added signaling support (RSVP over IP and 
UNI over ATM) and adaptation mechanisms to the above, to create adaptive 
layered network conferencing and video server applications. 

3 APPLICATION AND SESSION LAYER PROCEDURES 

In this section, we briefly summarize the feedback algorithms used by the 
receiver and the source to adapt to the network capacity and receiver load, 
and the signaling and session directory functionality required to handle layered 
applications on IP and ATM. Details can be found in (Yau et al. 1997). 

Our layered application uses three different adaptation mechanisms, which 
work over different time scales and distances. The first is adaptation to receiver 
load. Since our application performs decoding of the layered video in software, 
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the number of layers it can receive depends on the CPU load. The receiver 
monitors the time to process a frame of video to determine the number of 
layers it is able to process. This feedback is entirely local to the receiver’s 
machine and involves the shortest time interval. 

The application uses network signaling mechanisms (RSVP over IP and UNI 
over ATM) to determine network bandwidth availability. When the receiver 
load allows, the receiver probes the network with a reservation attempt to 
add the next layer. There is no danger of unstable behavior, since the layer 
is only added if its addition cannot cause congestion. This feedback involves 
the network and occurs over slightly longer intervals than CPU Monitoring. 

In the absence of signaling mechanisms (for example, IP networks without 
RSVP) the application uses a loss based feedback mechanism similar to RLM 
for network adaptation. However, this leaves us with the problems of stability 
and responsiveness mentioned before. We return to this issue in Section 4. 

The receivers provide feedback to the session originator about the link ca- 
pacities and receiver processing powers of the active receiver and network 
environment. The originator adjusts the bit rates being transmitted on each 
layer of the encoding hierarchy accordingly, and advertises the changed layer 
hierarchy information to the sources using the session directory protocol. The 
sources modify their transmitted hierarchies accordingly. This feedback in- 
volves both sources and receivers. For scalability reasons, it occurs over the 
longest time interval. Note, however, that this does not imply the system is 
not responsive, since the receiver adaptation (compared to RLM) is fast. The 
system adapts slowly to changes in the receiver set, and in comparison, all 
previous approaches did not adapt in this respect at all. 

In the IP network, a receiver changes its video quality by joining or leaving 
multicast groups using the Internet Group Management Protocol (IGMP). 
An IP multicast group is a many-to-many communication abstraction, so the 
receiver receives data from all sources transmitting to it. The receiver must 
also use RSVP to specify its QoS requirements, and make reservations for the 
layer. By waiting till the RSVP reservation request is successful before adding 
the next layer, the receiver can ensure that the network is not overloaded. 

In the ATM domain, a receiver changes its fidelity by joining or leaving 
a multicast Virtual Circuit (VC). In ATM 3.x, the join request can only 
be initiated by the source; hence, the receiver must send a request to the 
source to added to a specific layer. An ATM multicast VC is strictly one-to- 
many, so the ATM receiver must know the Service Access Point (SAP) address 
of each source in order to send the requests to join. The Session Directory 
(SDR) protocol must be modified to carry this source specific information. 
We accomplish this by adding a ATM_SRC message, which is transmitted by 
each source on the ATM network, and carries the address information. 

We also extend the SDR message to carry layering information. This mes- 
sage is periodically retransmitted by the session originator, and conveys in- 
formation about the number of layers, and the multicast address, and bitrate 
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associated with each layer. Based on feedback received from the receivers, 
the originator transmits a changed SDR message. On receiving the new SDR 
message, sources adapt their transmitted streams; receivers can detect and 
adapt to the change in the data stream. 



4 GATEWAY PROCEDURES 

To maintain transparent interdomain connectivity, the gateway performs the 
following tasks: 

• Translation of the connection setup messages. 

• Translation of the traffic parameters. 

• Translation of the session directory messages. 

• Admission control. 

• Data forwarding. 

• Priority service. 

As shown in Figure 1, two daemons, the SDR daemon and the gateway dae- 
mon, are responsible for the above tasks. The SDR daemon is only responsible 
for translating session advertisement messages from one domain to the other 
one. The gateway daemon is responsible for the other tasks. 

4.1 Signaling translation 

The gateway has to translate the signaling messages and map the traffic pa- 
rameters between the two domains. In the following subsection, we consider 
the case of an IP source and an ATM receiver, and after that, the case of an 
ATM source and an IP receiver. 

IP to ATM: The gateway learns of the existence of a new session from the 
session directory (SDR) protocol. It joins the multicast groups in order to 
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Figure 2 End-to-end signaling: IP to ATM case 

receive PATH messages for each group and hence learn the existence of an 
IP source. The gateway also uses SDR to announce the session on the ATM 
network. Figure 2 shows how the gateway translates the messages to have a 
coherent end-to-end signaling. 

The gateway learns of a new source when it receives a new PATH mes- 
sage on the base layer multicast address. After performing local admission 
control tests, the gateway behaves like a new source on the ATM network, 
and advertises its address in an ATM_SRC message. It also adds the layer by 
creating a multicast VC (with currently no receivers) for this layer with the 
QoS translated from the PATH message, and updating the database. 

When an ATM receiver wants to receive a particular source and layer, it 
sends a JOIN request to the gateway. On receiving such a JOIN, the gateway 
looks up the corresponding IP source from the database, and makes a reser- 
vation, using a RESV message to the IP source. Finally, the gateway adds the 
receiver to its ATM multicast VC, and updates the forwarding table. 

In case RSVP signaling is not present, the gateway will not be able to de- 
pend on PATH messages and NULL PATH messages to learn of the joining 
and leaving of IP sources. In this case, it joins all multicast groups, and uses 
the presence of data to learn about sources. On learning of new sources, it ad- 
vertises them using ATMJSRC messages. It uses timers to learn about sources 
leaving. In order to assist the receiver in performing network adaptation, the 
gateway performs priority service on the layers of the same session. This is 
further described in subsection 4.3. 

ATM to IP: When a session is created by an originator on the ATM net- 
work, the gateway learns about it from an SDR message. The gateway SDR 
daemon acts as an originator on the IP side, by choosing IP multicast ad- 
dresses for the session and sending SDR messages. Figure 3 shows how the 
gateway translates the signaling messages, when the source is in the ATM 
network and the receiver is in the IP network. 

The gateway learns the existence of a new ATM source by receiving an 
ATM.SRC message. The gateway sends as many JOIN messages as layers 
to the ATM source. It translates the QoS learned from the ATM signaling, 
performs the local admission control tests, and sends as many PATH messages 
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Figure 3 End-to-end signaling: ATM to IP case 

as existing layers to the corresponding multicast groups. The gateway then 
tears down the multicast VCs, since no receivers exists yet. For each ATM 
source, the gateway is seen as a different IP source, distinguished by source 
UDP port number, by the receivers in the IP domain. 

The gateway learns about the first IP receiver for a layer by receiving a 
RESV message on a multicast group. After performing local admission con- 
trol, the gateway sends a JOIN message to the ATM source requesting the 
particular layer. For subsequent IP receivers, no new state needs to be setup 
at the gateway, and the RESV message is not even forwarded to the gateway 
process by RSVP unless the reservations change. 

If the gateway does not support RSVP, the signaling is very simple. The 
gateway learns the existence of a new ATM source by receiving an ATM_SRC 
message. The gateway waits for an IP receiver to show interest by sending 
an IGMP report message. Once a receiver exists, the gateway sends a JOIN 
message to the ATM source, updates the database and forwarding table, and 
forwards the data to the IP side. The gateway also learns about receivers 
leaving using from the IGMP protocol, and deletes the corresponding ATM 
connections when all receivers for a given multicast address leave. 

4.2 Data forwarding 

To forward the data streams from the senders to the correct set of receivers, 
the gateway must identify: (1) the sender, (2) the layer, and (3) the receivers. 

When a packet is received on the IP side, packet header contains the IP 
multicast address, the UDP destination port, the IP source address, and the 
UDP source port. Based on this four fields, the gateway looks up a multicast 
VC handle from the forwarding table and transmits the data. When a packet 
is received on the ATM side, the gateway uses the multicast VC handle to 
look up the addresses to put on the outgoing packet header in the forwarding 
table. 
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Note that the forwarding state of the gateway is extremely simple. Packets 
are actually queued in the receive socket buffers (in the kernel). The gateway 
daemon processes packets one at a time, reading from one socket and imme- 
diately transmitting to another. In order to implement priority service, it just 
associates a priority with each socket, and serves them in priority order. In 
order to implement controlled discarding (such as RED In Out (RIO)), it uses 
IOCTL calls to read the socket buffer length. This results in a very simple 
and robust gateway design. 

4,3 Participation in adaptation procedures 

Where the gateway acts as a receiver into a network, either ATM or IP, it 
ensures that it adds and drops layers in order of the layer number. This func- 
tionality is redundant, since the receivers exhibit this property individually. 
However, it helps to stabilize the system against the effect of accidental use 
of the multicast address space of a layered session by an unrelated receiver, 
or against incorrect behavior on the part of layered receivers. 

The gateway is responsible for forwarding feedback messages from the re- 
ceivers to the originator of the session across the ATM/IP boundary. This 
allows the originator to get the full picture about the set of receiver and net- 
work capacities, so it can make its decision about changing the transmitted 
hierarchy. 

When the originator decides to change the layer hierarchy, it transmits a 
new SDR message. These messages are translated and forwarded from one 
network to the other by the gateway. This allows sources on both sides of the 
network to learn about the new hierarchy. The sources then start to transmit 
data according to the new hierarchy. On the IP network, they start sending 
the data according to the new scheme right away, and also transmit PATH 
messages to notify the network and the receivers about the changed resources 
needed. If resources are available, the reservations are modified by RESV mes- 
sages from the receivers without a need to tear down the current distribution 
tree. On the ATM network, the sources tear down the current distribution 
tree and wait for new requests to join from the receivers. The new multicast 
VCs are set up according to the modified bitrate requirements. 

The gateway participates in this adjustment of the distribution trees. The 
signaling actions taken by the gateway as part of the adaptation process are 
all the result of messages from the network sent by the receivers (e.g., RSVP 
RESV messages or ATM JOIN request), or the senders (e.g., RSVP PATH 
messages or ATM-SRC messages). The gateway never initiates any adaptive 
action, therefore its state is simple. For example, no timers need to be kept. 

However, in the absence of RSVP, the gateway must detect the coming and 
going of IP sources by using timers, since no explicit notification is provided by 
the network. The gateway also assists the adaptation process at the receiver 
by performing priority service of packets, based on layer number. The receiver 
performs loss based load shedding similar to RLM in the absence of RSVP. 
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Priority service concentrates the loss onto the highest layers in the case of 
congestion. This has two advantages. Firstly, the receiver can compute the 
loss rate for each layer separately, and this will be much higher than if the 
same loss was spread across all the layers. This gives a clearer signal to the 
receiver to drop the highest layer. For instance, the drop trigger thresholds can 
be set higher, so a few accidental packet drops do not cause the receiver to back 
off. This makes the adaptation more stable while still remaining responsive. 
The second advantage is that if the loss is concentrated on the highest layer, 
the visual quality of the video is less effected than if the same loss is spread 
across all the layers. 



5 IMPLEMENTATION STATUS AND PERFORMANCE 

The current implementation of the gateway at the user level has several limi- 
tations. The number of sessions that can be simultaneously handled is severely 
limited by the number of sockets available to a single UNIX process. The gate- 
way has also not been optimized for high throughput. Finally, the support for 
local admission control of the gateway resources and controlled dropping of 
low-priority packets under overload is not complete. The version of the gate- 
way that we used for the measurements implements simple priority service, 
with the priority based on layer number and QoS support. 

The gateway runs on top of the socket interface, and binds virtual circuits 
on the ATM network and multicast groups on the IP network to sockets. 
Thus, for example, for a single session with seven layers, one ATM sender, 
one IP sender, and arbitrary number of ATM and IP receivers, the gateway 
daemon uses twenty three sockets. Since a UNIX process has access to a 
maximum of 64 sockets, this limits rather severely the size and number of the 
sessions we can create. One possible solution to this problem lies in moving 
the gateway implementation into the kernel. This would reduce the memory 
copy overhead as well, leading to an improvement in the maximum throughput 
capacity. This may be appropriate for a stand alone machine, with the sole 
function of connecting layered applications across ATM and IP. 

The following experiments were conducted on a network testbed consisting 
of an ATM LAN and a 10 Base-T Ethernet LAN. The ATM LAN consists 
of a Fore ASX-200 WG switch, with multiple UltraSparc workstations con- 
nected using Fore SBA200 ATM interface cards, and running ForeThought 
4.1.0 driver software that implements an application programming interface 
to the ATM UNI 3.0. The machines on the Ethernet LAN run the ISI imple- 
mentation of RSVP on Solaris 2.5.1. The ATM and Ethernet LANs are con- 
nected by a Sparc20 workstation running Solaris 2.5.1 with ATM and Ethernet 
network interface cards. This workstation runs the gateway software. 

Our implementation is not optimized for fast forwarding performance. For 
example, it would be possible to move the forwarding function into the kernel 
to lower memory copy overheads and improve throughput. Hence, the first 
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experiment we performed is to test the maximum data forwarding capacity 
of the gateway. We found that even with multiple sessions running simultane- 
ously, the gateway is capable of forwarding data up to the full capacity of the 
Ethernet without overload. The signaling performance is also not degraded at 
this high load, indicating that the gateway has sufficient CPU cycles to spare. 

The next set of experiments presented explore the time to add a layer. Our 
receivers use signaling to probe the network for available capacity. In the worst 
case, this requires one round trip for each layer; the latency of this process is 
of concern. 

Figure 4 shows the round trip latency to add a layer with a single source 
on the ATM network, and a single receiver on the IP network. The round trip 
is measured from when the RESV message is transmitted by the receiver, to 
when the first data packet arrives at the receiver. It is important to note that 
the case of a single receiver is the worst, since it requires a full round trip 
for each layer. A second receiver on the Ethernet would observe much less 
latency; it would start receiving the data as soon as it bound a socket to the 
multicast address, as the data is already being transmitted on the LAN. 

The X-axis shows the layer number AT, while the Y-axis shows the time 
to add the Nth layer averaged over eighty repetitions of the experiment, 
and the 95% confidence intervals. We note that the average time to join a 
layer increases with the number of layers being added. Figure 5 shows the 
IP to ATM case. Table 5 shows the breakup of the round trip time into its 
major components, shown as a (MAX, AVERAGE, MIN) triple in units of 
milliseconds. These components are: 

• RSVP processing: the delay in the receiver host from when the decoder 
decides to add a layer to when the RESV message is sent. This processing 
is performed on the receiver, by the RSVP library and daemon code. The 
processing time of layer 1 is greatest, because an RSVP thread is created. 

• RESV processing: the delay in the gateway machine from when the gateway 
receives a RESV message to the time the multicast VC is setup. This 
involves setting up the internal tables, sending a JOIN message to the 
ATM source, and then performing ATM UNI signaling. 

• Data sent: the delay from the time the VC is setup to the time the first 
packet arrives at the gateway. A major component of this delay is the time 
spent waiting for the next round of transmission. In addition, for layer 1 
the source has to wake up a thread before data is transmitted. 

• Forwarding: the delay in the gateway from when the data is read to when 
the data is transmitted. This is not shown in the table since it is almost 
fixed at 1 ms. 

• Data read: the delay at the receiver, from when the data arrives in the 
kernel buffer to when the application reads it from the kernel buffer. 

We note that the major components of the delay, as well as of the variability, 
are the time to send the data from the source and the time to receive the data 
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Figure 4 Round trip latency for adding a layer: ATM to IP 
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Figure 5 Round trip latency for adding a layer: IP to ATM 

at the receiver. At the source, this time is the time from when the ATM 
connection has been successfully set up to when the next frame of data is 
due to be transmitted. Since the encoder places 4 frames of data into a single 
packet and the frame rate for this experiment was ten frames per second, the 
inter- transmission time is 0.4 seconds. We see that the send time is spread 
between 0 and 0.4 seconds as expected. At the receiver end, the time to 
receive the packet is also similarly affected by the time to process a frame, 
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Table 1 Break up of time to add a layer; ATM to IP case 



Layer 


RSVP proc . 


RESV proc . 


data sent 
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(36/43/140) 


(170/373/382) 


(0.2/0.4/0.7) 
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(3/5/7) 
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(308/373/402) 
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(3/5/7) 


(55/115/196) 
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(2/260/312) 
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(3/5/15) 


(35/72/123) 


(238/358/385) 


(2/317/331) 
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Figure 6 Gateway queue size 

since while the decoder is processing the previous frame it does not check if 
new data has arrived. When the decoding threads become idle, the thread 
which is blocked on the receive gets a chance to run and retrieve the data 
from the network. This time increases with layer number and saturates near 
0.4, since the receiver does not take on more layers than it can handle. Thus, 
the major element affecting the time to add a layer is the unit of transmission, 
which is this case happens to be four video frames. 

In order to maintain stability, the receiver has to perform some measure- 
ments of the last layer added before it can safely add the next layer. The inter- 
vals of time for this measurement increase as the receiver becomes loaded, to 
avoid oscillatory behavior. These intervals are of the order of seconds, which is 
quite large compared to the round trip time to add a layer. Thus, the signal- 
ing latencies are acceptable for the applications under study, and optimizing 
the signaling overhead by using ‘bundled VCs’ as proposed in HMC is not 
necessary. 
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Figure 6 shows the queue lengths for two simultaneous streams. The fore- 
ground stream is a 2.1 Mbps stream, carried with QoS support, using ATM 
and RSVP signaling. The background stream is a best effort stream with bi- 
trate increasing from zero to nine megabits per second. The X-axis shows the 
total bitrate of the combined traffic, while the Y-axis shows the queue length 
for foreground queue (diamond symbols) and total queue length (plus sym- 
bols). We note that the foreground queue remains small, even when the sum of 
bitrates exceed the capacity of the Ethernet (10 Mbps). This happens because 
of two reasons. Firstly, since the foreground traffic is shaped at the entrance to 
the ATM network, a burst of traffic is never injected in this stream. Secondly, 
the QoS stream is protected from the bursty behavior of the best effort back- 
ground stream by the per stream queueing and priority service in the gateway 
(as well as in the switches, etc). This experiment shows the effectiveness of 
the priority service and QoS mechanisms; even under overload conditions, all 
the queueing delay and loss are concentrated on the low priority stream. 



6 CONCLUSION 

We have presented an implementation of a gateway for connecting layered 
applications across ATM and IP networks. This implementation improves on 
previous research by extending the feedback algorithms for adaptation all the 
back to the source. This allows the source to select the correct number of lay- 
ers, and the bitrates for each layer, to accommodate the current network and 
receiver capacities. Our adaptation model has three different control loops, 
one limited to the receiver, a longer one involving the receiver and the net- 
work, and a third (longest) control loop involving the source, network and 
receiver. The combination of the three gives us allows the feedback to be 
scalable, stable, and still responsive. The gateway participates in the source 
adaptation by translating the feedback messages, and updating the layered 
hierarchy advertisements (in the session directory protocol) when they change. 

We also extend previous research by considering the addressing and naming 
translation issue. The extension of the session directory protocol to the ATM 
environment allows us to compensate for the lack of a multipoint-to-multipoint 
abstraction on the ATM network, since the receivers can find out about source 
information from the session directory. The gateway participates in the session 
directory protocol, to become aware of new sessions and sources, to advertise 
them on the other side, and to translate the addresses, port numbers and 
other information that the receivers need to join a session. The gateway acts 
as a proxy for each IP source on the ATM network, and acts on behalf of 
IP receivers on the ATM side. Instead of using preconfigured tables for the 
address translation, the gateway exchanges the necessary information through 
the session directory protocol. 

In another departure from previous work, our applications take care of the 
complexity of layering, such as ensuring that resources are not wasted for 
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higher layers when lower layers are not available, at the edge of the network. 
This simplifies the network from the point of view of network scalability. It 
also allows us to perform our experiments with a standard ATM and RSVP 
installation. Our experiments show that the signaling is not a major factor in 
the latency of the receiver based adaptive control. 

We deal with the case when RSVP is not present, by using a loss based 
mechanism similar to RLM. In this case, the receiver responds to congestion 
by detecting increased packet loss and dropping layers. Since the gateway is 
itself likely to be a bottleneck (going from a high speed ATM to a low speed 
Ethernet network), we concentrate loss at the gateway on to the highest layers 
by performing priority service. This gives the clearest feedback to the receivers, 
since the percentage of lost packets on the highest layer is maximized. It also 
minimizes the effect of the loss on the visual quality of the video. Finally, this 
action, being taken by an application level entity, does not cause any increase 
in the network complexity or a violation of layering. 

In conclusion, we consider all aspects of the layered multicast problem at 
the ATM/IP gateway. We contend that this makes our system more usable, 
more complete, more flexible, and more stable than previous prototypes of 
layered multicasting. 
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Abstract 

This paper considers a scenario where the traffic of several LANs is transported on 
Deterministic Bit Rate (DBR) or Statistical Bit Rate (SBR) ATM Virtual Channel 
Connections (VCCs), that are then multiplexed into a DBR Virtual Path 
Connection (VPC) with fixed, dedicated bandwidth. It is investigated whether or 
not it is suitable to shape VCCs according to a DBR or SBR traffic contract before 
multiplexing. Results show that DBR shaping is rather useless, as with respect to 
the unshaped case no significant utilisation gain can be achieved without 
introducing high delays in the shapers’ buffers, and that SBR shaping behaves no 
better, due to the impossibility of finding a typical burst duration and mapping it 
on SBR traffic descriptors. 
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1 INTRODUCTION 

When talking about ATM technology, its ability to differentiate the Quality of 
Service (QoS) of the carried traffic streams according to their needs is often 
mentioned as a graceful characteristic. Real time applications requiring stringent 
end to end information transfer delays and delay variations can be carried with the 
higher QoS class, which in the ITU-T terminology is called “QoS class 1” (ITU-T 
1.356, 1996). Applications more tolerant to delays, like data applications, can be 
carried either with QoS class 2 or QoS class U. QoS class 2 means that Cell Loss 
Ratio (CLR) is guaranteed to be lower that a certain bound, whereas QoS class U 
means that no guarantee at all is given, neither on losses nor on delays. Although 
data applications always have the ability to detect and recover packet losses 
through frame retransmission, a lot of simulation studies have shown how dramatic 
can be, in terms of increased end to end delays, the effect of unbounded cell losses 
during congestion periods (Bonaventure, 1997). In spite of being in principle 
“tolerant” to application delays, users of applications like Telnet, FTP, Web 
Browsing, etc., would greatly appreciate the performance improvement coming 
from limited cell losses in their data. 

Right after the ability of ATM to differentiate traffic into QoS classes, it can be 
recalled its scalability, i.e. its suitability to be used both as a LAN and as a WAN 
technology. In recent years, a lot of ATM LANs have been deployed and they are 
nowadays being used with success. Also, trials and experiments to deploy and 
operate ATM in the WAN environment have been performed, and ATM backbone 
networks are now a reality. 

In parallel, especially due to the booming growth of the Internet, IP protocol has 
consolidated its positions, and legacy LAN technologies like the Ethernet have 
been significantly improved (100 Mbit/s Ethernet switches being already widely 
deployed and Gigabit Ethernet being right round the comer). 

In summary, ATM can be successfully used as a backbone technology also to carry 
data traffic relaying on the IP protocol generated on non ATM LANs. The bursty 
nature of this kind of traffic gives the public carriers the opportunity to achieve 
some statistical gain (or “multiplexing gain”), but the need to do that while 
meeting some QoS contract rises a lot of traffic engineering issues, and this paper 
addresses some of them. 

The paper is organised as follows: in section 2 we describe how we performed 
some traffic measurements over CSELT’s LANs in order to verify some 
characteristics of LAN traffic (Self Similarity) that had already been described in 
literature (Leland, 1994) and to obtain the values needed to parameterise the source 
traffic models we used in the study. Such models are described in section 3. In 
section 4 we describe the simulation scenario we implemented, while in section 5 
we focus on the effectiveness of the traffic shaping as a way to meet QoS 
commitments. Finally, in section 6 we present some conclusion and outline the 
future work. 
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2 TRAFFIC CHARACTERISATION 

A thorough characterisation of the traffic generated by IP applications over todays 
LANs has a great importance when performing internetworking studies. After the 
pioneering work at Bellcore (Leland, 1991), which put into evidence the Self 
Similar and Long Range Dependence (LRD) characteristics of this type of traffic, a 
lot of efforts all over the world was made to faithfully reproduce these 
characteristics by means of models more complicated than the traditional markov 
ones. A good review of these traffic characteristics and proposed models and can 
be found in (Morin, 1996). 

Usually, the path followed by researchers in this area is to perform some 
measurement on real networks, verify Self Similar characteristics, provide an 
estimation of the Hurst parameter and of other parameters of interest and then use 
those values into some traffic models to show how good they are in reproducing 
the statistical characteristics of real traffic traces and/or their queuing behaviour. 
This is what we did in our study too, but we were less concerned about comparing 
performances of “advanced models”. This not because we believe we used the best 
possible traffic model, but because we are aware that whatever thorough the 
characterisation of some measured traffic is, it is probably very closely related to 
the network technology and to the applications and protocols used over it. This 
danger is very well explained in an article by Paxon and Floyd (Paxon, 1997), 
which also suggests the wide variation of parameter values used in the models as a 
method to extend the generality of internetworking studies performed. This was 
the approach followed in this study. 

For the sake of completeness, however, we briefly describe how we collected and 
analysed measurements on some CSELT’s LANs. 

We focused on a lOMbit/s Ethernet segment collecting traffic from several hosts 
(mainly PCs with windows 95 and Unix workstations, mostly running ordinary 
network applications such as e-mail clients, FTP an Web Clients, telnet, Xwin, Sun 
NFS). Measurements were collected by a Sun Ultra 1 Workstation with a 200 MHz 
processor, with the aid of a modification of the freeware “tcpdump”. The main 
modification consisted in the fact that no single packet information was stored on 
the disk but only, at fixed time intervals, the summary information about the 
number of Ethernet bytes seen on the segment. The time interval duration was one 
second, and measurements were repeatedly collected from 9.00 am to 17.00 pm, 
for sixteen working days. Some comparison of our data with data collected in 
parallel with the aid of a Wandell & Goltermann Da-30 protocol analyser, showed 
that packet losses by the tcpdump modification were limited, and estimated 
statistical parameters were not significantly affected. 

Unfortunately, the 10 Mbit/s Ethernet segment under measurement only collected a 
limited amount (say, less than !4) of the Intranet and Internet traffic 
generated/received by the 200 researchers hosted in our building. As a result, the 
sixteen daily collected traffic profiles often showed some evident nonstationarity. 
In order to increase stationarity, we superposed them four by four, thus obtaining 
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four aggregate profiles, being more stationary and potentially more representative 
of the traffic generated/received by all the researchers of our building. Of course 
such an operation was possible only because we performed load measurements, 
and we didn’t compute any packet interarrival time statistic. 

The four “aggregated” profiles were then analysed in order to compute some 
statistical parameters. Among them, the more relevant to this study were 

• the mean rate (in bytes/s); 

• the peak factor, i.e. the ratio of the variance of the number of bytes to the 
mean of the number of bytes seen at each one second time interval; 

• an estimation of the Hurst parameter based on the Index of Dispersion for 
Counts (IDC). See (Gusella, 1990). 

We then chose one of the profiles as being the more representative one, and used 
the computed values as a starting point to parameterise our model, whose 
description is in the following section. 



3 TRAFFIC GENERATION MODEL DESCRIPTION 
We already pointed out that in order to drive conclusions not limited to the traffic 
characteristics of a particular LAN, we preferred not to directly perform 
simulations with measured traffic traces. Instead, we used a traffic model initially 
parameterised on the basis of measurements and then varied its parameters. 

The model we implemented belongs to that category of bursty fluid models that try 
to achieve Self Similarity by aggregation of ON OFF sources with heavy tailed 
distributed on and/or off periods, as explained in (Morin, 1996) or in (Willinger, 
1995). In particular, our model output consists of the rate generated by sources that 
can become active according to a Poisson process with parameter X, when active 
generate traffic at a fixed rate R and whose active state duration h is heavy tailed 
distributed (see Figure 1). We will refer to that model as the Poissonian Arrival of 
Bursts (PAB) model. References to it can also be found in (Roberts, 1997). 

Instead of investigating “a priori” whether the infinite source approximation were 
valid or not to correctly reproduce the traffic generated by a finite (and indeed 
rather limited) number of users/applications, we preferred to compare the queuing 
behaviour of real traces and simulated traces, usually finding a good match. 
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Figure 1 - Traffic generation according to the PAB model. 
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The exact probability density function of the burst duration h is reported in 
equation (1), were H is the model’s resulting Hurst parameter and Tc and e have 
the meaning of maximum and minimum burst duration, respectively. 
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Due to the presence of a nonzero s and of a non infinite Tc, the exact expression of 
the Index of Dispersion for Counts (IDC) for the model generated traffic drifts 
from the ideal one, which would be as in reported in (2) and correspond to the IDC 
of asymptotically Self Similar traffic. For any nonzero 6 and finite Tc, the drift 
becomes more and more evident as t approaches zero or tend to infinity. 

IDC (t) = Kt 2H1 (where K is a constant) (2) 

The model is thus characterised by five parameters: X, R, H, 8 and Tc. We fixed 
the first three with the aid of three equations (not reported here) and of the three 
parameters extracted from measurements (mean rate, peak factor and Hurst 
parameter, see section 2), whereas Tc and 8 can be considered as “freedom 
degrees” of the model. In finite time simulations, they must be set to values 
different from infinite and zero, respectively. We developed an algorithm that, 
given an interval [tl, t2] over which a “small drift” of the real IDC from the ideal 
expression reported in (2) is allowed, finds the best 8 and Tc choices for a given 
maximum simulation time. In our study, we always required a “small drift” within 
the time range [0.1s, 50s]. The model then results as Self Similar and highly 
correlated (i.e. “Long Range Dependent” - LRD) at least within that range. In 
(Paxon, 1997), as an example, it is recalled that long term correlations of LAN 
traffic have frequently been observed “from hundredths of milliseconds to tens of 
minutes”. We are not so far from this range, as even if beyond 50s the IDC curve 
starts to drift from the ideal one, the traffic remains correlated well above this 
value. 



4 THE SIMULATION SCENARIO 

In the following we assume that the reader is familiar with these concepts: ATM 
Virtual Path (VP) and Virtual Channel (VC) connections, ATM traffic contracts, 
Deterministic Bit Rate (DBR) transfer capability, Statistical Bit Rate (SBR) 
transfer capability, Peak Cell Rate (PCR), Sustainable Cell Rate (SCR), Maximum 
Burst Size (MBS). Definitions can be found in ITU-T recommendation 1.371 
(ITU-T 1.371, 1996) or in ATM Forum traffic management specification 4.0 
(ATMF TM 4.0, 1996). 
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To address some of the traffic engineering issues outlined in the introduction, we 
simulated a simple scenario, which is the multiplexing of several VC connections 
into a single DBR VP Connection. This may occur, for example, on an output port 
of an ATM switch that collects traffic from an edge ATM network or directly from 
ATM cards of upstream IP routers. The DBR VP connection is assigned a fixed 
amount of the bandwidth of the physical link, as well as a fixed amount of buffer 
space. 

All the simulations presented in this study were performed at the fluid level, i.e. 
traffic sources produce as an output a sequence of couples like (Time Interval, 
Rate during Time Interval). The queues do not receive cells, but only the 
information about the intensity of an incoming workload and the duration of this 
intensity. Figure 2 summarises the simulation scenario we considered. 

shaper (DBR or SBR) 




Each VC connection has a traffic contract that can be either DBR or SBR, and a 
PAB model as described in section 3 generates the source traffic. 

In this study we considered two cases: 

• each VC connection can access all the buffer space and bandwidth reserved to 
the DBR VP (unshaped case), i.e. its traffic contract is either not controlled or 
traffic contract parameters are set to values that prevent shaping devices from 
performing significant actions on the incoming flow (e.g. if the traffic contract 
is DBR, the PCR of the connection is set equal to the line rate). 

• each VC connection, before accessing shared VP resources, is shaped 
according to a traffic contract, that can be either DBR or SBR. 

In the DBR case, a “fluid” shaping device can be thought as a queue served at the 
Peak Cell Rate (PCR) of the connection. In the SBR case, as a queue that can be 
served either at PCR or at a lower rate which is the Sustainable Cell Rate (SCR) of 
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the connection: the service rate is PCR as long as a token pool of size Maximum 
Burst Size (MBS) is nonempty, SCR otherwise. The token pool size is initially set 
at MBS, and it increases (or decrease) at a rate which is SCR minus the current rate 
of incoming traffic. The pool size can never exceed [0, MBS]. The size of the 
buffer in the shaping devices is set to infinite, i.e. no losses can ever occur in them. 
In both the unshaped and shaped cases the QoS class of the VC connections is QoS 
class 2, i.e. there is a commitment of a Cell Loss Ratio lower than a certain bound 
for each connection. In the following we suppose that due to the complete sharing 
of VP resources, achieving this commitment at the VP level (i.e. in the VP buffer) 
is equal to achieve it for the single connection. As VC shapers’ buffers, when 
present, are of infinite size, the VP buffer is the only point along the connections 
where overflow can occur. 

As pointed out in the introduction, although there’s no explicit commitment for 
delays in QoS class 2, the network engineering should not enable delays to become 
intolerably high. In our simplified scenario, delays can occur both in the VP buffer 
and, if shaping is performed, in VC shapers’ buffers. Therefore, the delay statistic 
we consider will be the sum of two terms: the mean delay encountered in the VP 
buffer and the average of the mean delays encountered in the VC shapers (if any). 
Instead of taking the pure output of the PAB models, to speed up the simulations 
we slotted them into fixed intervals of duration 0.1s. The lowest time dynamics we 
will be able to observe is thus limited to this value. Time dynamics lower than 0.1s 
would anyway have been filtered due to the buffer size of the multiplexer, which 
was chosen considerably high (see later). 

The sources we used are five different parameterisations of the PAB model 
presented in section 3. Table 1 summarises the main parameters for each one of 
them. 



Table 1 - Parameters of the Poissonian Arrival of Bursts (PAB) model used as 
traffic sources 





Mean (byte/s) 


Peak Fact (bytes/0. Is) 


H 


Parameterisation 1 


682000 


21655 


0.8 


Parameterisation 2 


682000 


21655 


0.9 


Parameterisation 3 


682000 


21655 


0.7 


Parameterisation 4 


682000 


43310 


0.8 


Parameterisation 5 


682000 


10827 


0.8 
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In particular, the first parameter set was derived from the analysis of measurements 
(see section 2). The others are one parameter variations from the first, to study 
what happens with increased autocorrelation (parameterisation 2), reduced 
autocorrelation (parameterisation 3), increased burstiness (parameterisation 4) and 
reduced burstiness (parameterisation 5). The variation in the autocorrelation was 
obtained by varying the Hurst parameter (the higher it is, the more autocorrelated 
the traffic), while the variation in the burstiness was obtained by varying the peak 
factor (the higher it is, the more bursty the traffic). 

In all the performed simulations, all the multiplexed sources belonged to the same 
parameterisation, i.e. we didn’t consider the case of mixed types of traffic sources. 
Note that the value of the Peak Factor at 0.1s did not come directly from the 
analysis of measurements (that for technical reasons were taken with a period of 
Is), but was extrapolated from the value computed at Is, as described in the 
following. 

If the traffic has Self Similar characteristics over a certain timescale range [tl, t2], 
then its index of dispersion for counts, over this range, has the expression reported 
in (2). The value of the Peak Factor at time t corresponds to IDC(t). In our 1Hz 
frequency measurements we could verify Self similar Characteristics over a range 
[Is, t2], were t2 depended on the measurement’s day but was always of the order 
of hundredths of seconds. We also estimated a Hurst parameter H close to 0.8, 
computed the peak factor at Is (i.e. IDC(ls)) and finally computed the value of k 
in equation (2). If the hypothesis that the traffic has Self Similar characteristics 
also on lower timescales is made, then the computation of IDC(O.ls) i.e. the Peak 
Factor at 0.1s is straightforward. This hypothesis is supported from a lot of 
empirical data analysed in several studies, see e.g. (Paxon, 1997). It wouldn’t make 
sense to extend it to timescales lower than 0.1s. 

In each simulation run we considered at least a simulated time span of 10 +5 seconds 
(more than a day), in order to ensure the correct statistical behaviour of our 
sources, which have Long Range Dependent characteristics. 

In all the simulations, the bandwidth of the DBR VP was fixed at 155 Mbit/s (thus 
representing the case of a single OC3 link dedicated to this kind of traffic) and its 
buffer space was fixed at 477000 bytes, i.e. 9000cells. This buffer space value is 
quite large (even if not unrealistic for today switches), and we choose it in order to 
better observe the effects of LRD traffic (the longer the buffer, the more relevant 
the impact of correlations properties in the traffic). 

In Figures 3 and 4 we present the losses vs. utilisation and the mean delay vs. 
utilisation plots of the unshaped case for the five parameterisations of Table 1. 
From Figure 3 it can be noted that the variation of the Hurst parameter (from 0.7 to 
0.9) does not significantly impact the loss ratio in the VP buffer. This is partly due 
to the finiteness of the buffer. Performing other simulations with larger buffer 
space (may be unrealistic for an ATM switch) differences between the curves with 
different H value started to be more observable. However, it should not be deduced 
that Self Similar properties in the traffic do not affect performances, (indeed, in 
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Figure 3 there’s no comparison with non Self Similar models). Only, whenever 
using heavy tailed ON OFF models in finite time simulations of finite buffer 
systems, the value of the Hurst parameter may not significantly affect the results. 
Mean delays are even less sensible to Hurts parameter variation (see Figure 4). 

On the contrary, the variation of the peak factor parameter, which is directly 
related to the burstiness of the sources on low timescales, significantly affects the 
performances, and should therefore receive much consideration when setting 
model parameters for engineering purposes. 

0.5 0,6 0.7 0,8 0.9 1 1.1 




Figure 3 - Losses vs. utilisation plots for the five source parameterisations 
considered - all sources are unshaped. 




Figure 4 -Mean delay in multiplexer vs. utilisation plots for the five source 
parameterisations considered - all sources are unshaped. 
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It can be recalled that in order to meet a QoS commitment on CLR for multiplexed 
DBR or SBR ATM connections, four methods may be used. 

The first one is simply to reduce the shared VP utilisation, Le. to move the working 
point of curves like the ones of Figure 3 towards the left bottom comer. This is 
always possible and the multiplexing delay is even reduced, but it leads to lower 
incomes for the network operator. 

The second one is to increase the level of multiplexing, i.e. to increase both the VP 
bandwidth and the number of admitted sources. Anyway, this is only possible if 
there are enough flows to multiplex, and in any case the VP bandwidth cannot 
exceed the one of the physical link. For some analytical considerations on the 
benefits of multiplexing with Fractional Brownian Motion traffic, see (Erramilli, 
1996). 

The third one is to increase the buffer space assigned to the VP. This always results 
in an increased multiplexing delay too, and there are cases were due to traffic 
source characteristics the buffer growth can be unacceptable. This is often referred 
as the “buffer ineffectiveness” for Long Range Dependent traffic. 

The fourth one is to perform shaping on the flows before multiplexing according to 
some traffic contract parameters, that should be carefully chosen. Supposing to 
leave the VP utilisation the same, this always results in the creation of a second 
delay component (due to the queuing into the shaper’s buffers), while the delay 
component due the multiplexer is expected to be reduced. The effectiveness of 
such a method depending on traffic source characteristics has already been 
questioned in (Erramilli, 1996) and is further investigated in the following of this 
paper. 

5 EFFECTIVENESS OF SHAPING IN REDUCING LOSS RATIO 
AND SIDE EFFECTS ON DELAYS 

5. 1 DBR shaping 

We start considering the case of VC connections being shaped before multiplexing 
according to a DBR traffic contract: whenever the source has a peak whose 
intensity exceeds a given Peak Cell Rate, it is limited to that PCR and the excess 
work is buffered. 

In the first simulated case, we multiplexed as many sources belonging to the first 
parameterisation listed in Table 1 as necessary to push, in the unshaped case, the 
loss ratio above 10' 5 (QoS class 2 target). This required 23 sources and led to a VP 
utilisation of 0.81. 

While keeping the generated traffic the same, we then varied the PCR of the 
shapers in order to evaluate their effectiveness in reducing the loss ratio below the 
QoS class 2 target. 

Results are reported in Figure 5 (for the moment, refer to the “PAB” plot only). In 
Figure 5(a) there is the value of the loss ratio in the VP buffer vs. the ratio of mean 




